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Preface 



Collected in this volume are the papers presented at the 2nd International Work- 
shop on Energy Minimization Methods in Computer Vision and Pattern Recog- 
nition (EMMCVPRI&9), held at the University of York, England, from July 26 
through July 29, 1999. The workshop is the second in what we hope will become 
a series. The Drst meeting was held in Venice in May 1997. The motivation in 
starting this series of meetings was the feeling that energy minimization meth- 
ods, a topic which has roots in various disciplines such as physics, statistics, and 
biomathematics, represent a fundamental methodology in computer vision and 
pattern recognition. Although the subject is traditionally well represented in 
major international conferences in the Delds of computer vision, pattern recog- 
nition, and neural networks, our primary motivation in organizing this workshop 
series was to oDer researchers the chance to report their work in a focussed forum 
that allows for intensive informal discussion. 

We received 35 submissions for this workshop. Each paper was reviewed by 
three committee members who were asked to comment on the technical quality 
of the submissions and provide suggestions for possible improvement. Based on 
the comments of the reviewers as well as on time and space constraints we se- 
lected Dve papers to be delivered as long oral presentations and 17 papers for 
regular oral presentation. We make no distinction between these two types of 
papers in this book. The book is organized into seven sections on shape, mini- 
mum description length, Markov random Delds, contours, search and consistent 
labeling, tracking and video, and biomedical applications. We believe that this 
topical coverage represents a good snapshot of the state of the art in the subject. 

Einally, we must oDer thanks to those who have helped us in bringing reality to 
the idea of holding this workshop. Firstly, we thank the program committee for 
reviewing the papers and providing insightful comments to their authors. We also 
gratefully acknowledge the work of the following people who helped in the review 
process: J. Clark, H. Deng, N. Duta, F. Ferric, G. Guo, E. M^min, P. P^rez, and 
L. Wang. Although the workshop was intended to be small we hope that this 
book will reach a larger audience. In this respect we are extremely grateful to 
Alfred Hofmann at Springer- Verlag who responded positively to our proposal to 
publish this volume in the Lecture Notes in Computer Science series. At York 
most of the hard work of assembling the reviews has been very professionally ex- 
ecuted by Sara- Jayne Farmer. We also warmly acknowledge the help of Massimo 
Bartoli in assembling the Dual proceedings volume. 
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A Hamiltonian Approach to the Eikonal 

Equation 
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Abstract. The eikonal equation and variants of it are of significant in- 
terest for problems in computer vision and image processing. It is the 
basis for continuous versions of mathematical morphology, stereo, shape- 
from-shading and for recent dynamic theories of shape. Its numerical 
simulation can be delicate, owing to the formation of singularities in the 
evolving front, and is typically based on level set methods introduced by 
Osher and Sethian. However, there are more classical approaches rooted 
in Hamiltonian physics, which have received little consideration in the 
computer vision literature. Here the front is interpreted as minimizing 
a particular action functional. In this context, we introduce a new al- 
gorithm for simulating the eikonal equation, which offers a number of 
computational advantages over the earlier methods. In particular, the 
locus of shocks is computed in a robust and efficient manner. We illus- 
trate the approach with several numerical examples. 



1 Introduction 

Variational principles emerged naturally from considerations of energy minimiza- 
tion in mechanics [11]. We consider these in the context of the eikonal equation, 
which arises in geometrical optics and, recently, which has become of great in- 
terest for problems in computer vision [4] . It is the basis for continuous versions 
of mathematical morphology [3,16,24,25], as well as for BlumU grassDre trans- 
form [2] and new dynamic theories of shape representation including [9,22,21]. It 
has also been widely used for applications in image processing and analysis [17,5], 
shape-from-shading [10] and stereo [8]. 

The numerical simulation of this equation is non-trivial, because it is a hyper- 
bolic partial diDerential equation for which a smooth initial front may develop 
singularities or shocks as it propagates. At such points, classical concepts such 
as the normal to a curve, and its curvature, are not deDned. Nevertheless, it is 
precisely these points that are important for the above applications in computer 
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vision since, e.g., it is they which denote the skeleton (see Figures 4 and 5). To 
continue the evolution while preserving shocks, the technology of level set meth- 
ods introduced by Osher and Sethian [14], has proved to be extremely powerful. 
The approach relies on the notion of a weak solution, developed in viscosity 
theory [6,12], and the introduction of an appropriate entropy condition to se- 
lect it. Care must be taken to use an upwind scheme to compute derivatives, 
so that information is not blurred across singularities. The representation of the 
evolving front as a level set of a hypersurface allows topological changes to be 
handled in a natural way, and robust, eD dent implementations have recently 
been developed [18]. 

Level set methods are Eulerian in nature because computations are restricted 
to grid points whose locations are Dxed. For such methods, the question of com- 
puting the locus of shocks for dynamically changing systems remains of crucial 
importance, i.e., the methods are shock preserving but do not explicitly deteet 
shocks. One approach, such as that taken in [20], is to rely on one-sided interpo- 
lation of the underlying hypersurface between grid points, to provide sub-pixel 
estimates of the singularities. Such methods suDer the disadvantage that the 
interpolation step is computationally very expensive, and introduces numerical 
thresholds for shock detection. Hence, in order to obtain satisfactory results, 
high order accurate numerical schemes must be used to simulate the evolving 
front [13]. 

On the other hand, there are more classical methods rooted in Hamilto- 
nian physics, which can also be used to study shock theory. To the best of our 
knowledge, these have not been considered in the computer vision literature. 
The purpose of this paper is to introduce these methods and a straightforward 
algorithm for simulating the eikonal equation. The approach oDers a number of 
computational advantages, in particular, the locus of shocks is computed in a 
robust and eD cient manner. The proposed algorithm is Lagrangian in nature, 
i.e., the front is explicitly represented as a sequence of marker particles. The mo- 
tion of these particles is then governed by an underlying Hamiltonian system. 
Such systems are of course fundamental in classical physics, and the technique 
we elucidate for shock tracking therefore has a natural physical interpretation 
based on elementary Hamiltonian and Lagrangian mechanics. 



2 The Eikonal Equation 



We begin by showing the connection between a monotonically advancing front, 
and the well known eikonal equation. Consider the curve evolution equation 






( 1 ) 



where C is the vector of curve coordinates, Af is the unit inward normal, and 
F = F{x^ y) is the speed of the front at each point in the plane, with F D 0 
(the case F D 0 is also allowed). Let T{x^y) be a graph of the solution surface, 
obtained by superimposing all the evolved curves in time (see Figure 1). In other 
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T(x,y) 





\ /be d V-T(x,y)=h 

/ \ 

d’ 



Fig. 1. A geometric view of a monotonically advancing front (Eq. 1). T[x^y) is 
a graph of the UolntionD surface, the level sets of which are the evolved curves. 

words, T{x^y) is the time at which the curve crosses a point {x^y) in the plane. 
Referring to the Dgure, the speed of the front is given by 



A number of algorithms have been recently developed to solve a quadratic 
form of this equation, i.e., DD TD^= These include SethianU fast marching 
method [18], which relies on an interpretation of HuygensU principle to eD ciently 
propagate the solution from the initial curve, and Rouy and TourinU viscosity 
solutions approach [15]. However, neither of these methods address the issue of 
shock detection explicitly, and more work has to be done to track shocks. 

A diDerent approach, which is related to the solution surface T{x^y) viewed 
as a graph, has been proposed by Shah et al [19,22]. Here the key idea is to use an 
edge strength functional v in place of the surface T{x^y)^ computed by a linear 
diDusion equation. The framework provides an approximation to the reaction- 
diDusion space introduced in [9]. However, it does not extend to the extreme 
cases, i.e., morphological erosion by a disc structuring element (reaction) or 
motion by curvature (diDusion). Hence, points of maximum (local) curvature are 
interpreted as skeletal points, and the framework provides a type of regularized 
skeleton. Its relation to the classical skeleton, obtained from the eikonal equation 
with F = 1, is as yet unclear. For example, the curvature maxima based skeleton 



h tan(D) (f DDTD"^ 



Hence, T{x^y) satisDes the eikonal equation 



DD TD F = lo 



( 2 ) 
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Fig. 2. Direction of a ray (| and the direction of motion of the wave front p. 
From [1]. 

may not be connected (see the examples in [19,22]). Nevertheless, the framework 
is computationally very eD cient since the governing equation is linear and can 
be implemented using Dnite central diDerences. Furthermore, it can be applied 
directly to greyscale images as well as to curves with triple point junctions. 

In the next section, we shall consider an alternate framework for solving 
the eikonal equation, which is based on the canonical equations of Hamilton. 
The technique is widely used in classical mechanics, and rests on the use of a 
Legendre transformation (see [1] for the precise deDnition) which takes a system 
of n second-order diDerential equations to a (mathematically equivalent) system 
of 2n Drst-order diDerential equations. We believe that for a number of vision 
problems involving shock tracking and skeletonization, this represents a natural 
way of implementing the eikonal equation. 

3 Hamilton’s Canonical Equations 

Following Arnold [1, pp. 248D258], we shall use HuygensD principle to show the 
connection between the eikonal equation and the Hamilton- Jacobi equation. For 
every point qo, deDne the function S'qQ(q) as the optical length of the path from 
qo to q (see Figure 2). The wave front at time t is given by ^q : S'qo(q) = ^0 The 
vector P = ^ is called the vector of normal slowness of the front. By HuygensD 
principle the direction of the ray (| is conjugate to the direction of motion of the 
front, i.e., p ^ = 1. Note that these directions do not coincide in an anisotropic 
medium. 

Let us specialize to the case of a monotonically advancing front in an inho- 
mogeneous but isotropic medium (Eq. 1). Here the speed F{x^y) depends only 
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on position (not on direction), and the directions of p and <| coincide. The action 
function minimized, S'(q^t), is dcDned as 

D 

Sqo^toin^t) = LdP 

a 



along the extremal curve D connecting the points (qo'^o) and (q^t). Here the 
Lagrangian 

L = / , uuu<uin 

F(x^y) 



is a conformal (inDnitesimal) length element, and we have assumed that the 
extremals emanating from the point (qo'^o) do not intersect elsewhere, i.e., they 
form a central D eld of extremals. Note that for an isotropic medium the extremals 
are straight lines, and that for the special case F{x^y) = 1, the action function 
becomes Euclidean length. 

It can be shown that the vector of normal slowness, p = ^, is not arbitrary 
but satisDes the Hamilton- Jacobi equation 



D n D 

= B5‘'‘ 



(3) 



where the Hamiltonian function iJ(p" q) is the Legendre transformation with 
respect to of the Lagrangian function T(q^ <|). Rather than solve the nonlinear 
Hamilton- Jacobi equation for the action function S (which will give the solution 
surface T{x^y) to Eq. 2), it is much more convenient to look at the evolution of 
the phase space (p^ q) under the equivalent Hamiltonian system 



o 



This oDers a number of advantages, the most signiDcant being that the equa- 
tions become linear, and hence trivial to simulate numerically. In the following 
we shall derive this system of equations for the special case of a front advancing 
with speed F{x^y) = I. 



4 The Hamilton- Jacobi Skeleton Flow 

For the case of a front moving with constant speed, recall that the action function 
being minimized is Euclidean length, and hence S can be viewed as a Euclidean 
distance function from the initial curve Cq. Furthermore, the magnitude of its 
gradient, DD aSD, is identical to I in its smooth regime, which is precisely where 
the assumption of a central Deld of extremals is valid. 

With q = (o;^^), p = {Sx^ Sy)^ associate to the evolving plane curve C D 
the surface & D given by 

§:=%x^y<S,^Sy):(x<y)U = h p ^ 
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The Hamiltonian function obtained by applying a Legendre transformation to 
the Lagrangian L =D(|D is given by 

H = Yi L = l® [Si + 

The associated Hamiltonian system is: 

n /L U M 

]Q = 0^ = (O'O> ^= — = ®{S,^Sy)t> (4) 

Evolve & under this system of equations and let &{t) D denote the resulting 
(contact) surface. Now project &{t) to to get the parallel evolution of C at 
time t, C{t). 

5 Numerical Simulations 








Fig. 3. The original binary shapes used in our experiments range in size from 
128x128 to 168x168 pixels^. 

In this section we apply the above theory to formulate an eD cient algorithm 
for simulating the eikonal equation, while tracking the shocks which form. Recall 
that since the approach is a Lagrangian one, marker particles will have to Drst be 
placed along the initial curve, which in our simulations is assumed to be a simple 
closed curve in the plane. ^ The evolution of marker particles is then governed 

^ The method also extends naturally to a set of open curves by interpreting S as an 
outward distance function from the collection of curve segments. The initial marker 
particles are placed on the boundaries of an infinitesimal dilation of each open curve, 
and are then evolved in an outward direction. 
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by Eq. 4. With q = {x^y)^ p = {Sx^ Sy) = D S', the system of equations 
= 0 ^ = 0 ; £1 = ^Sx^ ^ = ^Sy 

gives a gradient dynamical system. The second equation indicates that the tra- 
jectory of the marker particles will be governed by the vector Deld obtained from 
the gradient of the Euclidean distance function S, and the Drst indicates that 
this vector Deld does not change with time, and can be computed once at the 
beginning of the simulation. Projecting this 4D system onto the {x^ y) plane for 
each instance of time t will give the evolved curve C(t). 

In order to obtain accurate results, three numerical issues need to be ad- 
dressed. First, in order to obtain a dense sequence of marker particles, a contin- 
uous representation of the initial shapeU boundary {T{x^y) = 0, see Figure 1) 
is needed. Second, it is possible for marker particles to drift apart in rarefac- 
tion regions, i.e., concave portions of the curve may fan out. Hence, new marker 
particles must be interpolated when necessary. Third, whereas Dnite central dif- 
ferences are adequate for estimating the gradient of the Euclidean distance func- 
tion in its smooth regime, such estimates will lead to errors near singularities, 
where S is not diDerentiable. Hence, we use ENO interpolants for estimating 
derivatives [13]; the key idea is to obtain information from the DsmoothD side, 
in the vicinity of a singularity. The algorithm may now be stated as follows: 

1. Take as the initial curve T{x{s)^ y{s)) = 0, the given boundary of an 
object, assumed to be a simple closed curve in the plane. 

2. Create an ordered sequence of marker particles at positions D s apart 
along the boundary. 

3. Compute a Euclidean distance transform, where each grid point in 
the interior of the boundary is assigned its Euclidean distance to the 
closest marker particle. 

4. For each grid point in the interior of the boundary compute and store 
the components of the vector Deld D S', using ENO interpolants. 

5. Do for step from 0 to TOTALSTEPS % 

Do for particle from 0 to NPARTICLES % 

D Update the particlell position based on 
D S at the closest grid point: 
x{step + 1) = x{step) 0 D t D Sx" 
y{step + 1) = y{step) 0 D t D Sy 

D if (Distance(particle,next_particle) > aD 
interpolate a new particle in between. 

0 

0 

0 

The original binary shapes used in our experiments are depicted in Fig- 
ure 3. The simulations are based on a piecewise circular arc representation of 
the boundary, obtained using the contour tracer developed in [20] on the signed 
distance transform of the original shape. Prior to obtaining the contour, the 
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Fig. 4. The evolution of marker particles under the Hamiltonian system. The 
initial particles are placed on the boundary, and iterations of the process are 
superimposed. These correspond to level sets of the solution surface T{x^y) in 
Figure 1. Individual marker particles are more clearly visible in the zoom-in on 
the Dngers of the hand (top right). See Section 5 for a discussion. 
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Fig. 5. The evolution of marker particles under the Hamiltonian system. The 
initial particles are placed on the boundary, and iterations of the process are 
superimposed. Each iteration gives a level set of the solution surface T{x^y) in 
Figure 1. See Section 5 for a discussion. 
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distance transform is Gaussian blurred very slightly (D = 0 d: 5 pixels) to combat 
discretization. The birth of new marker particles (step 5) is also based on circular 
arc interpolation. Figures 4 and 5 depict the evolution of marker particles, with 
speed F = 1, for several diDerent shapes. For all simulations, the spacing D s of 
initial marker particles is 0.25 pixels, the spacing criterion for interpolating a new 
particle in the course of the evolution is aW s = 0.75 pixels, and the resolution 
of the Euclidean distance transform S is the same as that of the original binary 
image. The timestep D t is 0.5 pixels, and results for every second iteration are 
saved. The superposition of all the level curves gives the solution surface T{x^ y) 
in Figure 1. It is important to note that in principle higher order interpolants 
can be used for the placement of marker particles, and the resolution of the exact 
distance transform is not limited by that of the original binary shape. 

The results are comparable to those obtained using higher order ENO imple- 
mentations, although the algorithm is computationally more eD dent (linear in 
the number of marker particles). Informal timing experiments indicate that the 
eD ciency of the algorithm exceeds that of level set methods, except under the 
Dfast marchingD implementation, with which it compares favorably. However, 
when shock detection is included, the Hamiltonian approach has important con- 
ceptual and computational advantages. In particular, in contrast with level set 
approaches, topological splits are not explicitly handled, but shocks (collisions of 
marker particles) are. In eDect, the marker particles are jittered back and forth 
along the crest lines of the distance function S', leading to thick traces. 

The above simulation of the eikonal equation has a variety of applications 
in computer vision [4,9,22,21,10,8], mathematical morphology [3,16,24,25], and 
image processing and analysis [17,5]. If desired, it is also possible to formulate an 
explicit stopping condition for the marker particles. The key idea is to consider 
the net outward Dux per unit volume of the vector Deld underlying the Hamilto- 
nian system, and to detect locations where energy is lost. As a bi-product of this 
analysis, which will be described in future work, the skeleton can be robustly and 
eD ciently obtained using only local parallel computations. Figure 6 illustrates 
the potential of this method, on the same set of shapes. These results may be 
interpreted as a Dstopping potentialD; as marker particles enter the regime of 
negative Dux (shown in white) they can be extinguished. 

6 Conclusions 

In this paper we have introduced a new algorithm for simulating the eikonal 
equation. The method is rooted in Hamiltonian physics and oDers a number of 
computational advantages when it comes to shock tracking. In future work we 
plan to extend our results to 3D, where the underlying Hamiltonian system will 
have the same structure, and the same divergence analysis can be carried out. 
However, the placement and interpolation of marker particles on the propagat- 
ing surface will be more delicate. In closing, we note that in related recent work, 
vector Delds rooted in magneto-statics have been used for extracting symmetry 
and edge lines in greyscale images [7], and that a wave propagation framework 
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Fig. 6. A divergence-based skeleton, superimposed in white on the original bi- 
nary shapes (shown in grey). Comparing with the Hamiltonian system based 
Dows in Figures 4 and 5, these maps can be used to formulate a explicit stop- 
ping condition for the individual marker particles. 
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on a discrete grid has been proposed for curve evolution and mathematical mor- 
phology [23]. 
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Abstract. This paper demonstrates how a new shape form shading 
scheme can be used to extract topographic information from 2D inten- 
sity imagery. The shape-from-shading scheme has two novel ingredients. 
Firstly, it uses a geometric update procedure which allows the image ir- 
radiance equation to be satisfied as a hard-constraint. This not only im- 
proves the data-closeness of the recovered needle-map, but also removes 
the necessity for extensive parameter tuning. Secondly, we use curvature 
information to impose topographic constraints on the recovered needle- 
map. The topographic information is captured using the shape- index of 
Koenderink and van Doom [14] and consistency is imposed using a ro- 
bust error function. We show that the new shape-from-shading scheme 
leads to a meaningful topographic labelling of 3D surface structures. 



1 Introduction 

Marr identiDed shape-from-shading (SFS) as providing one of the key routes to 
understanding 3D surface structure via the 2^D sketch [17]. The process has 
been been an active area of research for over two decades. It is concerned with 
recovering 3D surface-shape from shading patterns. The subject has been tackled 
in a variety of ways since the pioneering work of Horn and his co-workerU in the 
1970ft [8,13]. 

The classical approach to shape-from-shading is couched as an energy min- 
imisation process using the apparatus of variational calculus [13,9]. Here the aim 
is to iteratively recover a needle-map representing local surface orientation by 
minimising an error-functional. The functional contains a data-closeness term, 
and a regularizing term that controls the smoothness of the recovered needle- 
map. Since the recovery of the needle-map is under-constrained, the variational 
equations must be augmented with boundary constraints. 

Despite considerable progress in the recovery of needle-maps using shape- 
from-shading [2,4,21], there are few examples of the use of the method for 3D 
surface analysis and recognition from 2D imagery [1]. One of the reasons for this 
is the lack of surface detail contained in the needle-map. This can be attributed to 
the fact that most shape from shading schemes over-smooth the recovered needle 
map. This is a disappointing omission since there is psychophysical evidence that 
shading information is a useful shape cue [15]. 
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We have recently embarked on a programme of work aimed at using shape- 
from-shading as a means of 3D surface analysis and recognition. The contribu- 
tions to date have been two-fold. We have commenced by bringing the apparatus 
of robust statistics to bear on the problem of needle map recovery [22,23]. By 
using robust error kernels to model the smoothness of the needle map, we have 
limited some of the problems of over smoothing of surface detail. Our second 
contribution has been to show how the extracted needle map information can be 
used for view-based object recognition [24]. Here shape- from-shading has been 
shown to deliver usable information for a simple histogram-based recognition 
scheme. 



Encouraged by these first results, we are currently investigating how more 
sophisticated surface representations can be elicited from 2D intensity imagery 
using shape-from-shading. In particular, we would like to capture the differen- 
tial or topographic structure of surfaces. Although this is a routine procedure 
in range imagery [5], there has been little effort directed at extracting topo- 
graphic structure using shape-from-shading. One notable exception is the work 
of Lagarde and Ferric [6] . Here the curvature consistency process of Sander 
and Zucker [20] is applied to the needle map as a post-processing step so as to 
improve the organisation of the field of principal curvature directions. There is 
no attempt to exploit curvature consistency constraints in the recovery of needle 
maps via the image irradiance equation. 



To meet the goal of recovering topographic information, we present a new 
shape-from-shading algorithm. The algorithm is based on a geometric inter- 
pretation of the ambiguity structure of the image irradiance equation in the 
under-constrained conditions that apply in shape-form-shading. At each image 
location, the available intensity information together with the physics of the im- 
age irradiance equation mean that the recovered surface normal must fall on a 
cone of ambiguity. The axis of the cone points in the light source direction. We 
can impose organisation on neighbouring surface normals by allowing them to 
rotate on their respective cones of ambiguity so as to satisfy consistency con- 
straints. Here we impose the neighbourhood organisation constraints in such a 
way as to encourage curvature consistency. Our modelling of curvature consis- 
tency is based around Koenderink and Van Doom’s shape index [14]. This is a 
scale- invariant measure which captures the different topographic classes using 
a continuous angular variable. Using the shape index allows surfaces to be seg- 
mented into meaningful topographic structures such as ridges or valleys, saddle 
points or lines, and, domes or cups. These structures can be further organised 
into simply connected elliptical or hyperbolic regions which are separated from 
one- another by parabolic lines. Our consistency model uses robust error ker- 
nels to model acceptable local variations in the shape-index. The model encour- 
ages parabolic structures (i.e. ridges and ravines) to be thin and contour-like. 
Hyperbolic and elliptical structures (domes, cups etc.) are encouraged to form 
contiguous regions. 
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2 The Variational Approach to SFS 

Central to shape- from-shading is the idea that local regions in an image E{x^y) 
correspond to illuminated patches of a piecewise continuous surface, z{x^ y). The 
measured brightness E{x,y) will vary depending on the material properties of 
the surface (whether matte or specular), the orientation of the surface at the 
co-ordinates (x,^), and the direction of illumination. 

The reflectance map^ R{P: q) characterises these properties, and provides an 
explicit connection between the image and the surface orientation. The surface 
orientation is characterised by the components of the surface gradient in the x 
and y direction, i.e. P = ff and ^ = f|- The shape from shading problem is to 
recover the surface z{x,y) from the image E{x^y). As an intermediate step, we 
may attempt to recover a set of surface normals or needle-map^ describing the 
orientations of surface patches which locally approximate z{x^y). 

To simplify the problem, most research has concentrated on recovering ideal 
Lambertian surfaces illuminated by a single point source located at infinity [3] . A 
Lambertian surface has a matte appearance and reflects incident light uniformly 
in all directions. Hence, the light reflected by a surface patch in the direction of 
the viewer is simply proportional to the orientation of the patch relative to the 
light source direction. If n = (— p, —q^ 1)^ is the local unit surface normal, and 
s = {—pu —Qh 1)^ fhe global light source direction, then the reflectance function 
is given by R{p, q) = n • s. 

The image irradiance equation states that the measured brightness of the 
image is proportional to the radiance at the corresponding point on the surface, 
which is R{p^q). Normalising both image intensity and reflectance map, the 
constant of proportionality becomes unity, and the image irradiance equation is 
simply E{x,y) = R{p,q). 



2.1 Horn and Brooks Algorithm 

This equation succinctly describes the mapping between the x, y co-ordinate 
space of the image and the the p, q gradient-space of the surface, but provides 
insufficient constraints for the unique recovery of the needle-map. Additional 
constraints, based on assumptions about the structure of the recovered surface, 
must be utilised. Invariably, it is smoothness of the needle-map that is assumed. 
Hence, the goal is to recover the smoothest surface satisfying the image irradi- 
ance equation. This is posed as a variational problem in which a global error- 
functional is minimized through the iterative adjustment of the needle map. Here 
we consider the formulation of Brooks and Horn [3] , which is couched in terms 
of unit surface normals. The Horn and Brooks error functional is defined to be 



1 = 



E{x, y) — U ' + A 



dn 



dx 



+ 



dn 



dy 



+ M(l|n||^-l) dxdy 



Brightness Err or 



RegularizingT erm 
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The functional has three distinct terms. Firstly, the brightness error encourages 
data-closeness of the measured image intensity and the reflectance function. It is 
the only term which directly exploits shading information. The regularizing term 
imposes the smoothness constraint on the recovered surface normals; it penalises 
large local changes in surface orientation, measured by the magnitudes of the 
partial derivatives of the surface normals in the x and y directions. The final 
term imposes normalization constraints on the recovered normals. The constants 
fi and A are Lagrangian multipliers. 

The functional is minimized by applying variational calculus and solving the 
Euler equation: 

(^E — n • s + AV^n — /in = 0 (2) 

To obtain a numerical scheme for recovering the needle-map we must discretise 
this variational equation to the pixel lattice by indexing the surface normals 
according to their co-ordinates (i,j) on the pixel- lattice. With this notation, the 
discrete numerical approximation to the Laplacian is 



where ^ 

is the average normal over the local 4-neighbourhood and e is the spacing of 
pixel-sites on the lattice. Upon substitution, the Euler equation becomes 




l^i,j ^i,j — ^ 



( 5 ) 



Rearranging this equation to isolate nij yields the following fixed-point iterative 
scheme for updating the estimated normal at the surface point corresponding to 
image pixel (i, j), at epoch /c -h 1, using the previously available estimate from 
epoch k: 




At first-sight, it appears necessary to solve for the Lagrangian multiplier, jXij 
on a pixel-by-pixel basis. However, it is important to note that jXij only enters 
the update equation as a multiplying factor which does not effect the direction 
of update, so we can replace this factor by a normalization step. Einally, we 
comment on the geometry of the Horn and Brooks needle-map update equation. 
It is clear that there are two distinct components. The first of these is in the 
direction of the average neighbourhood normal n • ^ . This component has a local 
smoothing effect. The second component is in the direction of the light-source 
direction s. This can be viewed as responding to the physics of the image ir- 
radiance equation and has step-size proportional to Eij — J • s. The relative 
step-sizes are controlled by the Lagrange multiplier. 
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The principal criticism of the Horn and Brooks algorithm and similar ap- 
proaches, is the tendency to over-smooth the recovered needle- map. Specifically, 
the smoothness term dominates the data term. Since the smoothness constraint 
is formulated in terms of the directional derivatives of the needle- map, it is 
trivially minimised by a flat surface. Thus, the conflict between the data and 
the model leads to a strongly smoothed needle-map and the loss of fine-detail. 
The problem is exacerbated by the need to select a conservative value for the 
Lagrange multiplier in order to ensure numerical stability [11]. 

Horn [11] attempts to reduce the model dominance problem by annealing 
the Lagrange multiplier as a final solution is approached. Meanwhile, we have 
used the apparatus of robust statistics to moderate the penalization of disconti- 
nuities [22]. 

3 A Novel Framework for SFS 

The idea underpinning our new framework for shape-from-shading is to guaran- 
tee data-closeness by treating the HR as a hard constraint. In other words we 
aim to recover a valid needle-map which satisfies the HR at every iteration. Sub- 
ject to this data-closeness constraint, the task of shape-from-shading becomes 
that of iteratively improving the needle-map estimate. Here, we do this using 
curvature consistency constraints. 

Our approach is a geometric one. We view the HR as defining a cone of am- 
biguity about the light source direction for each surface normal. The individual 
surface normals which constitute the needle-map can only assume directions the 
fall on this cone. At each iteration the updated normal is free to move away 
from the cone under the action of the local consistency constraints. However, 
it is subsequently mapped back onto the closest normal residing on the cone. 
By applying this constraint, we gain dual advantages in terms of both numerical 
stability and obviating the need for a Lagrange multiplier. More importantly, the 
needle-map evolves via a series of intermediate states which are each solutions 
of the HR. 

3.1 Hard Constraints 

The new framework requires us to minimize the constraint functional 



whilst satisfying the hard constraint, imposed by the image irradiance equation 



Here N(x,^) is the set of local neighbourhood vectors about location (x^y). 
For example, in terms of lattice coordinates i, j, the 4- neighbour hood of iiij is 
defined as 





(8) 



N — j j , rij j(g,i } 



(9) 
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The function 7 /; (n(x, ^), N(x, ^)) is a localized function of the current surface 
normal estimates. The size of the neighbourhood may be varied according to the 
nature of Clearly, it is possible to incorporate the hard data-closeness con- 
straint directly into but this needlessly complicates the mathematics. Instead, 
we choose to impose the constraint after each iteration by mapping the updated 
normals back to the most similar normal lying on the cone. 

The resulting update equation for the surface normals can be written as 

( 10 ) 

where nf • is the surface normal that minimises the constraint functional In- 

•'ij 

The hard image irradiance constraint is imposed by the rotation matrix 0 which 
maps the updated normal to the closest normal lying on the cone of ambiguity. 
Another way to look at this is that we allow the smoothness constraint to select 
the direction of the normal estimate in the image plane only, whilst fixing the 
angle between the normal estimate and the light source direction. 

To achieve the rotation, we define an axis perpendicular to the intermediate 
update normal, and the light source direction. The axis of rotation is found 
by taking the cross-product of the intermediate update with the light source 
direction {u^v^w)^ = x s. The angle of rotation is the difference between 
the angle subtended by the intermediate update and the light source, and the 
apex angle of the cone of ambiguity. Since the image is normalized, the latter 
angle is simply cos^^ E", giving a rotation angle of 

e = - cos®^ ( ^ + cos®^ E 

Vll"t'll ii^ii / 

Hence, the rotation matrix is given by 

( C + —ws + uvc^ vs + uwc^ 

WS + UVC^ C + v‘^C^ —US + vwc^ 

—vs + UWC^ US + vwc^ c + w‘^c^ 

where c = cos(6>), = 1 — c, and s = Figure 1 illustrates the update 

process, and compares it with the Horn and Brooks update equation. 



3.2 Initialization 

The new framework requires an initialization which ensures that the image ir- 
radiance equation is satisfied. This differs from the Horn and Brooks algorithm, 
which is usually initialized by estimating the occluding boundary normals, with 
all other normals set to point in the light source direction. 

To satisfy the image irradiance equation, we must choose a normal direction 
from the infinite possibilities defined by the cone of ambiguity. 

We choose to initialize each normal such that its projection onto the image 
plane lies in the opposite direction to the image gradient direction, as shown 
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Fig. 1. Comparison of update process for Horn and Brooks (left) and the new 
framework with the same smoothness constraint (right). Horn and Brooks allows 
the updated normal to move away from the cone of ambiguity, sacrificing bright- 
ness error for smoothness. The movement in the light source direction is asin the 
left-hand diagram, where a = ^ {E — n • s) from Equation 10. In contrast, the 
new framework forces the brightness error to be satisfied by using the rotation 
matrix 0 to map the smoothness update to the closest point on the cone. 



in Figure 2. This results in an initialization with an implicit bias towards con- 
vex rather than concave surfaces. In other words, bright regions are assumed 
to correspond to peaks, and the image gradient direction points towards these 
peaks. 



s 




Fig. 2. The set of surface normals at a point which satisfy the Image Irradiance equa- 
tion define a cone such that E-n.s = 0. A normal from this set is chosen such that 
the direction of its projection to the image plane is opposite to the maximum intensity 
gradient direction, g. 



We have also applied this initialization to the Horn and Brooks algorithm 
in place of the traditional occluding boundary initialization. We find that this 
initialization produces significantly better and faster results for the Horn and 
Brooks algorithm. However, in common with the Horn and Brooks algorithm, 
our schemes are sensitive to initialization. It is impossible to say whether using 
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the image gradient as described above is the best possible initialization, but it 
does have stable properties and is, intuitively, a reasonable method of estimating 
initial normal direction. 



4 Needle Map Smoothness Constraints 



Before we describe our modelling of curvature consistency, we pause to consider 
how needle-map smoothness can be incorporated into our new framework for 
shape-from-shading. In a recent paper [23], we showed how needle-map smooth- 
ness could be modelled using robust error kernels. Here the adopted framework 
was based on a regularised energy- function similar to that underpinning the Horn 
and Brooks algorithm. However, rather than using a quadratic smoothness prior, 
we used a continuous variant of the Huber robust error kernel. In this section we 
show how this smoothness model can be used in conjunction with our geometric 
needle-map update process In essence, we consider that the recovered surface 
should be smooth, except where there is a high probability that a discontinuity 
is present, in which case the smoothing is reduced. 

We define the robust regularizer constraint function as 



■^(n,N) = 



f 


dn 


\ , ( 


dw 


per 1 


dx 




dy 



( 12 ) 



where Pcr(^) is a robust kernel defined on the residual r] and with width parameter 
cr. Applying the calculus of variations to the resulting constraint function Ic 
yields the general update equation 




In [23] we experimented with several robust error kernels, including Li’s 
Adaptive Potential Functions [16], and the Tukey [7] and Huber [12] estima- 
tors. However, the sigmoidal-derivative M-estimator, a continuous version of 
Huber’s estimator, proved to possess the best properties for handling surface 
discontinuities, and is defined to be 






log cosh (^) 



Substituting Equation 15 into Equation 13 yields the update equation 




( 15 ) 
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It is illuminating to consider the behaviour of this update equation for small 
and large smoothness errors. Firstly, the averaging of the neighbourhood normals 
is moderated by a function of the form tanh [^^] • This averaging effect is 
most pronounced when the smoothness error is small i.e. when the surface is 
already approximately fiat. The other contribution to the smoothness process 
is of the form 7^^(g^^sech^?7 — tanh 77). This term vanishes at the origin 
and tends towards zero for large values of 77, only kicking-in at intermediate 
error conditions. Thus, there is no smoothing pressure from either process upon 
strong discontinuities. 

The robust regularizer approach provides significantly improved results over 
the simple Horn and Brooks smoothness constraint, but at the expense of intro- 
ducing the parameter, a. 



5 Curvature Consistency 

Needle-map smoothness appears to be an over-strong and inappropriate con- 
straint for shape from shading. This is primarily because real surfaces are more 
likely to be piecewise smooth; in other words, formed of smooth regions separated 
by sharp discontinuities in depth or orientation. The over-smoothing problem is 
exacerbated by the difficulty of formulating the continuous concept of smooth- 
ness on a discrete pixel lattice, as clearly illustrated by the fact that the Horn 
and Brooks smoothness constraint is trivially minimized by a needle-map corre- 
sponding to a planar surface. 

Here we take a different tack by using curvature consistency. Although the 
curvature classes either side of a depth discontinuity may be completely unre- 
lated, this is not the case for an orientation discontinuity. Orientation disconti- 
nuities usually correspond to ruts or ridges. Furthermore, the curvature classes 
for locations either side of a rut or a ridge should be the most similar classes, 
either trough or saddle rut for a rut, or dome or saddle ridge for a ridge. This 
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property of smooth variation in class suggests that curvature consistency may be 
a more appropriate constraint for SFS than smoothness, which strongly penalises 
legitimate orientation discontinuities. 

The use of a curvature consistency measure was introduced to SFS by Ferric 
and Lagarde [6]. They use global consistency of principal curvatures [18] to refine 
the surface estimate returned by local shading analysis. Curvature consistency 
is formulated in terms of rotating the local Darboux frame to ensure that the 
principal curvature directions are locally consistent. 

An alternative method of representing curvature information is to use H — K 
labels, but these require us to set 4 thresholds to define the classes in terms 
of the mean and Gaussian curvatures. However, we propose to use curvature 
consistency based upon the shape index of Koenderink and van Doom [14]. This 
is a continuous measure which encodes the same curvature class information as 
H — K labels in an angular representation, and has the further advantage of not 
requiring any thresholds. 



5.1 The Shape Index 

We reformulate the definition of the shape index in terms of the needle-map. This 
allows us to use the needle-map directly, rather than needing to reconstruct the 
surface. 

The differential structure of a surface is captured by the Hessian matrix which 
may be approximated in terms of surface normals by 




where and (* * -)y denote the x and y components of the parenthesized 

vector respectively. 

The eigenvalues and /^2 of the Hessian matrix, found by solving the eigen- 
vector equation \H — nl\ =0, are the principal curvatures of the surface. Koen- 
derink and van Doom [14] used the two eignevalues to define the shape index 

/ 2 /^2 + . . 

(p = — arctan ki > K 2 (18) 

7T K2 — 

This may be expressed in terms of surface normals thus 




Figure 3 shows the range of shape index values, the type of curvature which they 
represent, and the grey- levels used to display different shape-index values. 
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CUP RUT SADDLE RIDGE CAP 

TROUGH SADDLE SADDLE DOME 

RUT RIDGE 

-1 0 1 SHAPE INDEX 

1 128 255 GREYLEVEL 

Fig. 3. The shape index scale ranges from -1 to 1 as shown. The shape index values 
are encoded as a continuous range of grey-level values between 1 and 255, with grey- 
level 0 being reserved for background and flat regions (for which the shape index is 
undefined) . 

Table 1 shows the relationship between the shape-index and the mean and 
Gaussian curvature classes. It is important to note that there are adjacency 
constraints applying to the topographic classes In particular, the the cup (C) 
and dome (D) surface types may not appear adjacent to each other on a surface. 
Moreover, elliptic regions on the surface (those for which K is positive) must be 
separated from hyperbolic regions (those for which K is negative) by a parabolic 
line (where K=0). Parabolic lines are effectively zero crossings of the mean or 
Gaussian curvatures. In other words, domes and cups are enclosed by ridge or 
valley-lines. 
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Elliptic 
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0 


Parabolic 






5 3\ 

' 8 ’ 8 / 


Saddle-rut 


sv 


+ 


- 


Hyperbolic 
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Table 1. Topographic classes. 



5.2 Adaptive Robust Regularizer Using Curvature Consistency 

As stated above, since the shape index is an angular, physical measure, we ex- 
pect it to vary gradually over a smooth surface. For instance, with reference 
to Figure 3, we would not expect the shape index at adjacent pixels to differ 
by more than one curvature class unless they lie on opposite sides of a surface 
discontinuity. Since the over-smoothing effect of the quadratic smoothness con- 
straint stems directly from the indiscriminate averaging of normals lying across 
a discontinuity, we anticipate that weighting according to curvature consistency 
will reduce the problem in a physically-principled manner. 

To meet these goals we use curvature consistency to control the robust weight- 
ing kernel applied to the variation in the needle-map direction. The idea is a 
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simple one. We use the variance of the shape-index in the neighbourhood N to 
control the width a of the robust error-kernel applied to the directional deriva- 
tives of the needle map. The kernel width determines the level of smoothing ap- 
plied to the surface normals in the neighbourhood. If the variance of the shape 
index is large i.e. the neighbourhood contains a lot of topographic structure, 
then we choose a small kernel width. This limits the local smoothing and allows 
significant local variation in the local needle-map direction. From a topographic 
viewpoint, we can see the rationale for this choice by considering the behaviour 
of the needle-map and the shape-index at ridges and ravines. For such features, 
the direction of the needle-map changes rapidly in a particular direction. These 
two structures are parabolic lines which intercede between elliptic and hyper- 
bolic regions. As a result there is a rapid local variation in shape index. Turning 
our attention to the case where the shape-index variance is small, then the kernel 
width is large. This is the case when we are situated in a hyperbolic or elliptic 
region. Here the shape-index is locally uniform and the needle-map direction 
varies slowly. 

Once again, we use the robust error-kernel of Equation 12 to model needle- 
map smoothness. However, instead of using a fixed kernel of width, cr, we adapt 
the width. The variance dependance of the kernel is controlled using the expo- 
nential function 

Here (pc is the shape index associated with the central normal of the neighbour- 
hood, (pi is one of the neighbouring shape-index values and Apd is the 

difference in shape index between the centre values of adjacent curvature classes 
listed in Table 1. The number of neighbourhood normals used in calculating the 
finite difference approximations to and ^ is denoted and ao is a refer- 
ence kernel width which we set to unity. Using the scale of Figure 3, Apd = |- 
To summarise, if the shape index varies significantly over the neighbourhood, a 
small value of a results, and the robust regularizer saturates to produce a heavy 
smoothing effect. In contrast, when the shape index values are already similar, 
the kernel is widened so that little smoothing occurs. When this model is used, 
the needle-map update equation is identical to that of Equation 13. However, 
now the error kernels adapt locally in line with the shape-index variance. 

6 Experiments 

In this section we provide some experimental evaluation of the new shape-from- 
shading technique. The evaluation focuses on the quality of the shape-index 
information extracted from the intensity images used in our experiments. The 
images used in our study are taken from the Columbia University COIL data- 
base. We furnish some comparison between the use of curvature consistency 
constraints and the simple needle-map smoothness constraint. 
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We commence in Figure 4 by showing a sequence which shows the shape 
index evolving with iteration number. In each case, the left-hand column shows 
the results obtained using the curvature consistency constraint developed in 
the paper, whilst the right-hand column shows the results of using a quadratic 
smoothness constraint within the same framework. 

Using the curvature consistency scheme, the elliptic and hyperbolic region 
classes become more connected while the parabolic lines (i.e. ridges and ravines) 
become thinner and more continuous. In contrast, using the smoothness con- 
straint leads to loss of the ridge and ravine structure. The regions are noisy and 
exhibit poor connectivity. 

The final shape-index images contain some features which merit special men- 
tion. In particular the ravines defining the boundaries of the wing of the duck 
are well segmented. In addition, the slot in the top of the piggy-bank is correctly 
identified as ravine. 

Figure 5 demonstrates that the curvature classes recovered using shape- from- 
shading are stable to viewpoint changes. Here we show six views of a toy-duck. 
Notice how the valley lines around the beak and the wing are well recovered at 
each viewing angle. Also notice how the shape of the addle structure below the 
wing is maintained. 

Finally, we provide an initial comparison of the schemes using a simple syn- 
thetic image. The object used in this study is a hemisphere on a plane. The ideal 
shape index histogram for the object would consist of a small peak correspond- 
ing to the boundary where the hemisphere meets the plane at -0.5 (rut), and a 
large peak at 1.0 (dome). Figure 6 shows the shape index histograms recovered 
using curvature consistency (left) and needle-map smoothness (right). 



7 Conclusions 

This paper has presented a new shape-from-shading algorithm which uses cur- 
vature consistency to extract topographic information from 2D intensity images. 
The approach has two novel ingredients. Firstly, we provide a geometric frame- 
work for needle-map recovery. This process iterates between two steps. Firstly, 
we modify the local surface normal direction to satisfy local consistency con- 
straints. Secondly, we back-project the updated normals onto a cone of shading 
ambiguity so as to satisfy the image irradiance equation as a hard constraint. 
The second novel contribution resides in the modelling of curvature consistency. 
Here we use the Koenderink and Van Doom shape-index as a measure of surface 
topography. We use the variance of the shape-index to control the width of a 
robust error kernel which controls needle- map smoothness errors. In this way we 
ensure that the local smoothness of the needle-map responds to the variability 
of the local surface topography. We illustrate the comparative advantages of the 
new method on real-world imagery from the COIL object data-base. Here it 
proves to be reliable in delivering both smooth elliptic and hyperbolic regions 
together with thin and continuous parabolic lines. 
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Fig. 4. Evolution of Shape Index classes with iterations of the SFS schemes. Top 
row, Smoothness constraint. Second row, Curvature consistency constraint. Third row. 
Smoothness constraint. Bottom row. Curvature consistency constraint. 



Our future plans revolve around exploiting the topographic information de- 
livered by the new shape-from-shading scheme for 3D object recognition from 
2D imagery. 
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Abstract. A new representation called Harmonic Shape Images for 3D 
free-form surfaces is proposed in this paper. This representation is 
based on the well-studied theory of Harmonic Maps which studies the 
mapping between different metric manifolds from the energy- 
minimization point of view. The basic idea of Harmonic Shape Images 
is to map a 3D surface patch with disc topology to a 2D domain and 
encode the shape information of the surface patch into the 2D image. 
Due to the application of harmonic maps in generating Harmonic Shape 
Images, Harmonic Shape Images have the following advantages: they 
preserve both the shape and the continuity of the underlying surfaces, 
they are robust to occlusion and they are independent of surface 
sampling strategy. The proposed representation is applied to solve the 
surface-registration problem. Experiments have been conducted on real 
data and results are presented in the paper. 



1 Introduction 

Shape representation is a fundamental issue in computer vision. In the past few 
decades, the objects under study have evolved from 2D objects to 3D polygonal 
objects, parametric surfaces and free-form surfaces. In this paper, the objects that are 
studied are 3D free-form surfaces represented by polygonal meshes. 

A large amount of research has been done on surface registration([17]-[21]) and 
surface representation([5]-[16]). According to the way which objects are represented, 
existing representations of 3D free-form surfaces can be regarded as either global or 
local. Examples of global representations include algebraic polynomials[15], 
spherical representations such as EGI[5], SAI[6][7] and COSMOS[ll], triangles and 
crease angle histograms [12], and HOT curves[13]. Although global representations 
can describe the overall shape of an object, they usually have difficulties handling 
occlusion and clutter. 

Among the local surface representations proposed so far, the one that uses local 
shape signature[16] is among those that have been quite successful in real 
applications. This representation is independent of surface topology and easy to 
compute. The matching performance using this representation decreases gracefully as 
the occlusion and clutter increase. However, the major limitation of this 
representation is that it provides only sets of individual point correspondences that 
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have to be grouped into sets of mutually consistent correspondences. This limitation 
stems from the fact that the local signatures only partially capture the shape of the 
surface. 

In this paper, as part of our continuing effort to develop the data-level 
representations for surface matching, we investigate the surface-matching problem 
using a mathematical tool called harmonic maps with the goal of addressing the 
limitation of the representation in[16]. The harmonic map theory studies mappings 
between different metric manifolds from an energy-minimization point of view. With 
the application of harmonic maps, a surface representation called Harmonic Shape 
Images is created and used for surface matching. Furthermore, owing to the properties 
of harmonic maps. Harmonic Shape Images are able to provide all the point 
correspondences once two regions are matched. This will be shown in detail in the 
paper. 

The basic idea of Harmonic Shape Images is to map a 3D surface patch with disc 
topology to a 2D domain and to encode the shape information of the surface patch 
into the 2D image. This simplifies the surface-matching problem to a 2D image- 
matching problem. When constructing Harmonic Shape Images, harmonic maps 
provide a mathematical solution to the mapping problem between a 3D surface patch 
with disc topology and a 2D domain. 

This paper is organized as follows: the concept of Harmonic Shape Images is 
explained in Section 2; in section 3, the generation of Harmonic Shape Images is 
discussed in detail; the properties of Harmonic Shape Images are discussed in section 
4; in Section 5, as an application example. Harmonic Shape Images are applied to 
solve the surface-matching problem. Conclusions and future work are presented at the 
end of the paper. 



2 The Concept of Harmonic Shape Images 

Studied in this paper, the 3D free-form surfaces are represented by polygonal meshes. 
According to [18], a free-form surface S is defined to be a smooth surface such that 
the surface normal is well defined and continuous almost everywhere, except at 
vertices, edges and cusps. Comparing two such meshes directly is difficult due to the 
following reasons: the topology may be different for different objects; the sampling 
may be different even for the same object; the surfaces may not be complete because 
of occlusion and clutter in the scene. 

In order to address the above difficulties, it is advantageous for the shape 
representation to have the following characteristics: it is local; it is defined on a 
simple domain; and it fully captures the shape of the original surface. Local 
representations have the power to deal with occlusion and clutter. This has already 
been demonstrated by some of the existing representations[16]. Defining the shape 
representation on a simple domain, e.g., a 2D image, facilitates the comparison of the 
shape representations. In other words, the problem of surface matching in 3D can be 
reduced to the problem of image matching in 2D. If the shape representation fully 
captures the shape of the original surface, e.g., curvature and continuity, then 
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correspondences between the two surfaces can be established once their shape 
representations match. 

The development of Harmonic Shape Images was motivated by the above 
requirements. Given a 3D surface S as shown in Fig. 1(a), let v denote an arbitrary 
vertex in S. Let D(v, R) denote the surface patch which has the central vertex v and 
radius R and has disc topology. D(v, R) is connected and consists of all the vertices in 
S whose surface distance is less than, or equal to, R. The overlaid region in Fig. 1(a) is 
an example of D(v, R). Its amplified version is shown in Fig. 1(b). If the unit disc is 
selected to be the 2D domain and D(v, R) is mapped onto the domain using certain 
strategy, then the resultant image HI(D(v, R)) called harmonic image as shown in 
Fig. 1(c). The harmonic image preserves the shape and continuity of D(v, R). This 
will be explained further in the next section. Because correspondences are established 
between the vertices in D(v, R) and the vertices in HI(D(v, R)), the Harmonic Shape 
Image of D(v, R), HSI(D(v, R)), can be obtained by associating shape attributes, e.g., 
curvature, at each vertex of D(v, R) with the corresponding vertex in HI(D(v, R)). Fig. 
1(d) shows the Harmonic Shape Image of the surface patch in Fig. 1(b). The curvature 
values are gray-scale coded. 







(a) (b) (c) (d) 

Fig. 1. (a) A surface patch (overlaid region) D(v, R) on a given surface; (b) Amplified version 
oiD(v, R) in (a); (c) The harmonic image oiD(v, R); (d) The Harmonic Shape Image ofD(v,R). 



Harmonic Shape Images can be generated for any vertex on a given surface as long 
as there exists a valid surface patch at that vertex. Here, a valid surface patch means a 
connected surface patch with disc topology. The generation of Harmonic Shape 
Images will be discussed in detail in the next section. 



3 The Generation of Harmonic Shape Images Based on Energy 
Minimization 

Given a surface patch D(v, R) and a selected 2D domain M, the generation of its 
Harmonic Shape Image can be summarized as two steps: at the first step, a mapping 
between D(v, R) and M is constructed, which results in the harmonic image HI(D(v, 
R))\ at the second step, shape attributes are stored at each vertex of the harmonic 
image, which results in the Harmonic Shape Image HSI(D(v, R)). In comparison with 
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the second step, the first step is much more difficult and complicated. The 
mathematical tool, harmonic maps, is employed to construct the mapping between 
D(v, R) and M. In this section, we first briefly introduce harmonic maps and then 
discuss in detail how to use harmonic maps to generate the harmonic image HI(D(v, 
R)). At last, we will talk about how to obtain Harmonic Shape Image HSI(D(v, R)) 
from harmonic image HI(D(v, R)). 



3.1 Harmonic Maps 

According to [2], the concept of harmonic maps is closely related to the concept of 
geodesics. Geodesics are the shortest connection between two points in a metric 
continuum, e.g., a Riemannian manifold. Geodesics are critical points of the 
following length integral 




where c : [0,1] % N is the parameterization of the curve proportional to arc length. 
The generalization of the energy integral in (1) for maps between Riemannian 
manifolds leads to the concept of harmonic maps. Harmonic maps are critical points 
of the corresponding integral where energy density is defined in terms intrinsic to the 
geometry of the source and target manifolds and the map between them. 

Formally, harmonic maps are defined as follows[l][2]. Let (M, g) and (N, h) be 
two smooth manifolds of dimensions m and n, respectively and let 
&\ (M,g) % {N,h) be a smooth map. Let (x^), i = 1, i , m and (y^), 3 = 1, i , nhQ 
local coordinates around x and (S(x), respectively. Take (x^) and (y^) of M and N at 
corresponding points under the map d^hose tangent vectors of the coordinate curves 
are and , respectively. Then the energy density of dGs defined as 



1 m « 3 ^r) 



i,j#\ 3 ,) #1 



( 2 ) 



in which and the inverse of are the components of the metric 

tensors in the local coordinates on M and N. The energy of local coordinates is 
given by the number 



E{(^ # 

M ^ 

If dGs of class , + , and dGs an extremum of the energy, then dGs called a 

harmonic map and satisfies the corresponding Euler-Lagrange equation. 

The above is a general definition of harmonic maps. Now let us look at a special 
case in which both the source manifold and the target manifold are surfaces in the 3D 
Euclidean space. To be more specific, let D be a surface of disc topology and P be a 
planar region. According to the results in the theory of harmonic maps[2], the 
following problem has a unique solution: given a homeomorphism b between the 
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boundary of D and the boundary of P, there exists a unique harmonic map <^\D% P 
that agrees with b on the boundary of D and minimizes the energy functional of D. 

Furthermore, the harmonic map d^has the following properties [2]: it always exists; 
it is unique and continuous; it is one-to-one and onto and it is intrinsic to D and P. All 
the above properties show that the harmonic map d^s a well-behaved mapping. 

Recall that the mapping to be constructed from the surface patch D(v, R) to the 2D 
domain M; it can be seen that harmonic maps provide a good solution to this problem. 



3.2 Approximation of Harmonic Maps 

As we have already seen in the previous section, harmonic maps are solutions of 
partial differential equations. Due to the expensive computational cost in solving 
partial differential equations and the discrete nature of surfaces we deal with in 
practice, it is natural to look for an approximation of harmonic maps when mapping 
D(v, R) to M. 

Eck et al. proposed an approximation approach to harmonic maps in [4]. EckB 
approximation consists of two steps. At the first step, the boundary of the 3D surface 
patch is mapped onto the boundary of an equilateral triangle that is selected to be the 
2D target domain. At the second step, the interior of the surface patch is mapped onto 
the interior of the equilateral triangle with the boundary mapping as a constraint. Our 
approach uses the same interior mapping strategy as that of EckB approach but a 
different target domain and a different boundary mapping strategy. In this section, we 
will first discuss the interior mapping with a given boundary mapping and then 
discuss our boundary mapping in detail. 



3.2.1 Interior Mapping 

Let D be a 3D surface patch with disc topology and P be a unit disc in 2D. We use 
D(v, R) to denote that the central vertex of D is v and the radius of D is 7? which is 
surface distance. Let 3D and 3P be the boundary of D and D, respectively. Let v/^ 
i=l,i ,n\ be the interior vertices of D. The interior mapping dtoaps v/^ i=l,i ,n\ onto 
the interior of the unit disc P with a given boundary mapping Z? : 3D % 3P . d^is 
obtained by minimizing the following energy functional: 

( A:^.||40. 47)f (4) 

{iJhMsesiD) ^ ’ 

In (4), for simplicity of the notation, we use (^i) and (S(j) to denote S(vl) and d^v/), 
which are the images of the vertices v/ and v- on P under the mapping & The values 
of S(i) define the mapping &ky serve as spring constants with the definition in (5) 

ky # Ctg/ , e^j ) 0 ctg/ {e,i , ey ) ( 5 ) 

in which / ) and / , eij ) are defined in Fig2.. If Cy is associated only with 

one triangle, then there will be only one term on the right of (5). 
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Fig. 2. Definition of spring constants. 



An instance of the functional can be interpreted as the energy of a spring 
system by associating each edge in D with a spring. Then the mapping problem from 
D to P can be viewed as adjusting the lengths of those springs when squeezing them 
down onto P. If the energy of D is zero, then the energy increases when the springs 
are squeezed down to P because all the springs are compressed. Different ways of 
adjusting the spring lengths correspond to different mappings & The best dteiinimizes 
the energy functional the definition of the spring constants in ( 5 ), the best 

d^best preserves the ratios of edge lengths in D; therefore, the shape of D, under the 
boundary mapping b. 

The minimum of the energy functional E(^^ can be found by solving a sparse linear 
least-square system of ( 4 ) for the values c^i). This results in a set of linear equations 
that can be written in matrix form 

Ki2 



In (6 ),X„i2 ^ ^ • ^ni2 denotes the unknown 



coordinates of the interior vertices of D when mapped onto P under & 
sparse matrix which has the following structure: 
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If D is considered to be a non-directional graph, then interpreted as an 

adjacency matrix of the graph. All the diagonal entries of A^^^ are non-zero. For an 
arbitrary row i in , if vertex v/ is connected to vertices and v/, then only the 
mth and yth entries in row i are non-zero. Similarly, the ith entry in row m and row j 
are also non-zero. Therefore, in addition to the usual properties that matrix A^^^ has 
in least-square problems, A^^n sparse in this particular case. The boundary 
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condition is accommodated in the matrix ^ vertex v/ is connected to 

boundary vertices, then its corresponding entry i in weighted by the 

coordinates of those boundary vertices. Otherwise, the entries in b^i 2 

After d^s solved from (6), the vertices in D are mapped onto the domain P. The 2D 
image of D after being mapped onto P is named the harmonic image HS(D) of D. 
Examples of D and HS(D) are shown in Fig. 3(a) and (b) respectively. 




(a) (b) (c) 

Fig. 3. (a) An example of surfaee pateh; (b) Its harmonie image; (e) Its Harmonie Shape 
Image, (e) is the same as (b) exeept that it has a shape attribute gray-seale eoded at eaeh vertex. 
It will be introdueed later in the paper. 



3.2.2 Boundary Mapping 

The construction of the boundary mapping Z? : 3D % 3P is illustrated in Fig. 4. 





Fig. 4. Illustration of the boundary mapping between the surfaee pateh and the 2D domain. 

First of all, let us define the vertices and vectors in Fig. 4. D is the central vertex of 
D and Oi is the center of P. V/, i=l,i ,5 are the boundary vertices of D. D is said to 
have radius R when the surface distance from any vertex in D to the central vertex O 
is less than, or equal to, R. For some boundary vertices, e.g., v[, i=l,i ,4, the surface 
distance between any of them and the central vertex O is equal to R\ for other 
boundary vertices, e.g., Vi, the surface distance is less than R. The vertices in the 
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former case are called radius boundary vertices and the vertices in the later case are 
called occluded boundary vertices. Radius boundary vertices are determined by the 
size of the surface patch, while occluded boundary vertices are determined by 
occlusion (either self occlusion or occlusion by other objects). The vector from the 
central vertex O to a radius vertex v[ is called a radius vector, while the vector from O 
to an occluded boundary vertex v/ is called an occlusion vector. 

Now let us define the angles in Fig. 4. Angles i=l,i ,4 are the angles between 

two adjacent radius vectors and Angles bj, j=l, 2, are the angles 

between two adjacent occlusion vectors, or one occlusion vector and one adjacent 
radius vector, in an occlusion range. An occlusion range is a consecutive sequence of 
occlusion boundary vertices except for the first and last ones. For example, (v/, vf, 
v[) is an occlusion range. The sum of bj over an occlusion range is the angle 
formed by the first and last radius vectors in that range. For example, the sum of bj 
over (v/, vf, v[) is a 

The construction of the boundary mapping consists of two steps. At the first step, 
the radius boundary vertices are mapped onto the boundary of the unit disc P, which 
is a unit circle. In Fig. 4, v[, i=l,i ,4, are mapped to v[\ i=l,i ,4, respectively. It can 
be seen that once the angles 4 are determined, the positions of vP are determined. 4 
is computed as follows: 



/,. #-^29 

^ n 

( 

km 



0 ) 



At the second step, the occlusion boundary vertices in each occlusion range are 
mapped onto the interior of the unit disc P. For example, in Fig. 4, vP, which is in the 
occlusion rangle (v/, vP, vP), is mapped onto vP\ Once the angles 8j and the radii rj 
are determined, the position of vP is determined. 8j are computed as follows. 



J n ^ 

( K 

m#l 



( 8 ) 



in which n is the number of angles within the occlusion range and at is the angle 
corresponding to the occlusion rangle. r^is defined to be 



0 ^ 



dist{rj , O) 



( 9 ) 



in which dist(rj,0) is the surface distance between the occlusion boundary vertex v/ 

and the central vertex O. R is the radius of the surface patch D. 

Two issues need to be mentioned regarding the above boundary mapping. The first 
one is that the boundary vertices of D need to be ordered in either a clock- wise or 
counter-clock-wise manner before constructing the boundary mapping. Similarly, 
when mapping vertices onto the boundary of P, either clockwise or counter-clockwise 
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order needs to be determined. The two orders must remain consistent for all surface 
patches for the convenience of matching. 

The second issue is how to select the starting vertex among the boundary vertices 
of D. If the starting vertex is always mapped to the same vertex on the boundary of 
the unit disc, then different starting vertices will result in different boundary 
mappings. This will, in turn, result in different interior mappings. For example, the 
harmonic image shown in Fig. 3(b) has the starting vertex indicated by the black 
arrow, while in Fig. 5(b), the harmonic image of the same surface patch is different 
due to a different starting vertex. 




(a) (b) (c) 

Fig. 5. The harmonic image(b) and Harmonic Shape Image(c) of the same surface patch(a) 
shown in Fig. 3(a) with a different a starting vertex for boundary mapping. 

In fact, the harmonic images with different starting vertices are different by a 
planar rotation. The reason for this is that neither the angles //(in (7)) nor the angles 
8i (in (8)) will change with respect to different starting vertices. Nor will the radius rj 
in (9) change. Therefore, the starting vertex can be selected randomly. The rotation 
difference will be found later by the matching process that is discussed in [23]. 



3.2.3 The Generation of Harmonic Shape Images 

In Section 3.2.2, we have shown that, given a surface patch D, its harmonic image 
HS(D) can be created using harmonic maps. There is one-to-one correspondence 
between the vertices in D and the vertices in HS(D). Harmonic shape images, HSI(D), 
are generated by associating a shape attribute at each vertex of HS(D). In our current 
implementation, an approximation of the curvature at each vertex is used to generate 
Harmonic Shape Images. For details about the curvature approximation, please refer 
to [6]. 

Fig. 3(c) and Fig. 5(c) are examples of Harmonic Shape Images in which high 
intensity values represent high curvature values and low intensity values represent 
low curvature values. 

Similar to Harmonic Shape Images, more images can be generated by associating 
other properties, e.g., color and texture, to each vertex in harmonic images. This 
shows that harmonic maps provide a general framework for storing surface-related 
information. 
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4 Properties of Harmonic Shape Images 

Due to the way that Harmonic Shape Images are generated and the application of 
harmonic maps in the generation, Harmonic Shape Images have many good 
properties. In this section, we will discuss these properties and the advantages that 
benefit surface matching. 

Local Representation. Harmonic Shape Images are created for surface patches on 
a given surface. In fact. Harmonic Shape Images can be generated for every vertex on 
the surface as long as there exists a valid patch at that vertex. For an arbitrary surface 
patch D(v, R), the radius R controls the size of the patch. When R is small, the 
Harmonic Shape Image of D(v, R) represents only the shape of the neighborhood 
around v. This shows that Harmonic Shape Images are local representations which do 
not depend on the overall topology of the underlying object. 

Parameterization from D to P. At the first step of generating Harmonic Shape 
Images, a mapping needs to be created between the surface patch D and the 2D 
domain P. This mapping actually constructs a parameterization of D in P. Because the 
mapping is one-to-one and onto, correspondences between D and P can be 
established. This means that once two Harmonic Shape Images match, the 
correspondences between the two surface patches can be obtained immediately. In 
addition to having the properties of one-to-one and onto, this parameterization allows 
us to resample the original surface patch, if necessary. In practice, resampling such as 
raster scanning makes it easy to compare two Harmonic Shape Images. 

Intrinsic to the Shape of the Underlying Surface. Different mappings between 
the surface patch D and the 2D domain P can be established. By using harmonic 
maps, the mapping constructed in this paper is intrinsic to the shape of the underlying 
surface patch D. In other words, this mapping is invariant to the position of D in 3D 
space. Further more, the approximation of harmonic maps used in our current 
implementation does not depend on a specific sampling strategy, e.g., uniform 
sampling. It can be applied to any triangular meshes. As long as the sampling rate is 
high enough, the mapping between D and P can be approximated with good accuracy. 
Harmonic Shape Images are obtained by associating curvature attributes at each 
vertex of the harmonic maps. Therefore, Harmonic Shape Images are invariant to 
rotation and translation, and are determined only by the shapes of the underlying 
surfaces. 

Uniqueness. Following the above property, it can be seen that different surfaces of 
different shapes have different Harmonic Shape Images. No two surfaces with 
different shapes have the same Harmonic Shape Image. This means that Harmonic 
Shape Images are unique. This is an important property when applied to object 
recognition. 

Existence. The existence of harmonic maps from a surface with disc topology to a 
planar domain ensures the existence of Harmonic Shape Image for a given surface 
patch D. 

Robustness. One important issue for surface representations is their robustness 
with respect to occlusion. In this section, we first explain why Harmonic Shape 
Images are robust to occlusion and then, using real data, demonstrate the robustness 
under occlusion. 
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The reason for the robustness of Harmonic Shape Images with respect to occlusion 
lies in the way in which the boundary mapping is constructed. Recall that in Section 
3.2.2 the boundary vertices are classified into radius boundary vertices and occluded 
boundary vertices. It is the radius boundary vertices that determine the angles r//, 
which then determine the overall boundary mapping. The effect of occlusion is 
limited within the occlusion range; therefore, it does not propagate much outside of 
the occlusion range. This means that, as long as there are enough radius boundary 
vertices present in the surface patch, the overall harmonic image will remain 
approximately the same in spite of occlusion. 

The following experiment was conducted to demonstrate the robustness of 
Harmonic Shape Images under occlusion. Let us use the surface patches Dj and D 2 in 
Fig. 6(1) and (2) as an example to illustrate the experiment. D 2 is the same as Dj 
except for the occluded region. Their Harmonic Shape Images are shown in Fig. 6(5) 
and (6), respectively. It can be seen that the Harmonic Shape Images of Dj and D 2 are 
similar in spite of the occlusion on D 2 , This means that the occlusion part of D 2 does 
not affect much of the shape representation for the non-occluded regions, thus making 
it possible to match D 2 to Di hy matching the Harmonic Shape Images of their non- 
occluded regions. The normalized correlation coefficient of HSI{Di) and HSI{D 2 ) is 
0.9878, which verifies that D 2 can still be matched correctly to Dj with occlusion. 




(1) (2) (3) (4) 




(5) (6) (7) (8) 

Fig. 6. A surface patch and its variations with different parts cut off to simulate occlusions. 

A sequence of surface patches with different occlusion parts on D; is matched to 
D] to further test the robustness of Harmonic Shape Images under occlusion. Fig. 6(3) 
and (4) are two examples in the sequence. Their Harmonic Shape Images are shown 
in Fig. 6(7) and (8), respectively. Fig. 7 shows the matching result. 

It can be seen from Fig. 7 that as the percentage of occlusion boundary increases, 
the occluded mesh area also increases. However, the normalized correlation 
coefficient remains stable in the range of [0.8, 1.0]. Fig. 6(8) shows the Harmonic 
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Shape Image of the surface patch in Fig. 6(4). In spite of severe occlusion, the 
Harmonic Shape Image for the non-occluded regions is still similar to that in Fig. 6(5). 




Fig. 7. Matching results under occlusion. The original surface is shown in Fig. 6(1). Some of 
the surfaces with occlusion are shown in Fig. 6(2)-(4). (a) Normalized correlation coefficient; 
(b) Percentage of occluded area. 



5 Surface Registration Using Harmonic Maps 

Given two 3D surfaces Sj, S 2 and no prior about the their relative position, the 
problem of surface matching is to find the rigid transformation between Sj and S 2 and 
establish point correspondences between Sj and S 2 . In this section, the surface- 
matching problem is solved using Harmonic Shape Images. Experimental results 
using real data are presented. The algorithm for matching Harmonic Shape Images is 
discussed in detail in [23]. 

Two surfaces Sj and S 2 shown in Fig. 8(a), (b) are to be registered. The surface 
patch in wireframe overlaid on N; is a selected surface patch Afv/, R) on Sj. After 
searching through all the valid surface patches on N 2 by comparing Harmonic Shape 
Images, the surface patch that best matches D/v/, R) is found. Fig. 8(c) shows the 
correspondences between the two matched surface patches. The transformation 
between Sj and S 2 can be computed using those correspondences. The registered 
surfaces are shown in Fig. 8(d). 



6 Conclusion and Future Work 

In this paper, a shape representation called Harmonic Shape Images is proposed for 
3D free-form objects. Owing to the application of harmonic maps. Harmonic Shape 
Images have many good properties, e.g., they preserve both the shape and continuity 
of the underlying surfaces. Preliminary results have shown that Harmonic Shape 
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Images are robust to occlusion. The experiment on surface registration has 
demonstrated the application of Harmonic Shape Images in surface matching. 

There are two directions for future work. The first one is to develop a new 
similarity criterion for estimating Harmonic Shape Images. The second one is to apply 
the proposed Harmonic Shape Images to more applications such as object recognition 
and classification. 




(a) (b) (c) (d) 

Fig. 8. Surface registration using Harmonic Shape Images, (a), (b) surfaces to be registered; (c) 
Some of the correspondences between the best matched surface patches; (d) The registered 
surfaces. The overlaid surface patch in (a) is randomly selected on Sj. For the surface patches 
on *S 2 that are good matches of the overlaid surface patch in (a), their central vertices are marked 
on S 2 . The best match is indicated by the black arrow. 
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Abstract. Size functions are functions from the real plane to the natural 
numbers useful for describing shapes of objects. They allow to translate 
the problem of comparing shapes to the problem of comparing functions, 
that is a much simpler task. In order to perform the comparison between 
size functions we present a method to measure the energy necessary to 
deform size functions into each other. Minimizing such an energy allows 
for a measure of the similarity between shapes. Some experimental results 
concerning the comparison of free hand-drawn sketches are shown. 



1 Introduction 

Size functions are a mathematical tool originally conceived for representing 
shapes but applicable for studying all data that can be seen as topological spaces. 
Size functions have already been successfully used to deal with various recogni- 
tion problems (see, e.g., [1], [2], [3], [4], [13], [14], [15], [16], [17], [18]). Basically, 
a size function is the output of a transform whose input is a topological space 
together with a real function defined on it. The topological space is the signal 
(e.g. B/W image, colour image, sound wave) under study, and the real function 
defined on it is the criterion used to study the signal. Each size function is a 
natural valued function of two real variables. 

The purpose of this paper is to provide a mathematical method for measuring 
the energy necessary to deform size functions into each other. This gives a mea- 
sure of the similarity of the corresponding topological spaces. In order to do so, 
size functions are preliminarily transformed into simpler objects, precisely into 
particular formal series. This representation contains almost the same amount 
of information about the space under study as the original size function does 
but it is much easier to handle. Then we show how to assign an energy to any 
deformation of a formal series associated with a size function. This way we define 
a deformation energy for size functions. Finally, we measure the extent to which 
two spaces resemble each other by minimizing the energy necessary to deform 
the corresponding size functions into each other. 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 44-53, 1999. 

[b Springer- Verlag Berlin Heidelberg 1999 
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This framework for the comparison of size functions is treated mainly from 
a theoretical point of view, nevertheless some experiments on B/W images are 
presented. 

We shall confine ourselves to the study of a method for measuring the simi- 
larity between shapes, leaving out the problem of classification. 

2 Size functions 

Let Af be a subset of some Euclidean space. We shall call any continuous function 
a measuring function. A4 is the signal we want to study and (/? is a 
function chosen according to the properties of the signal we are interested in. 

The size function of the pair (Af , (f) is a function ^ IN U {+ 00 } . 

For every pair (x^y) G IR^ consider the set of points in Ai at which ip takes value 
smaller or equal to x. Two such points are considered equivalent if they either 
coincide, or can be connected in A4 by a path at whose points (p takes value 
smaller or equal to y. Then is defined to be equal to the number of 

equivalence classes so obtained. 

For an example of size function see Figure 1. In this example the set A4 is a 
curve in the plane. By choosing the function ordinate of the point as measuring 
function, one obtains the size function shown on the right. The number displayed 
in each region of the domain of the size function denotes the value taken by the 
size function in that region. 



Fig. 1. A plane curve and the associated size function with respect to the 
measuring function ordinate of the point: (p{x, y) = y. The number displayed in 
each region of the domain of the size function denotes the value taken by the size 
function in that region. Continuous lines represent lines made of discontinuity 
points for the size function. 




► 



X 
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We remark that computing a size function is equivalent to counting the num- 
ber of connected components of a graph: this gives the complexity of our task 
(see [6], [7]). 

From the definition of size function one has immediately that if ior y < x 
there exists an infinite number of points P ^ M such that y < (f{P) < x then 
— +CO. On the other hand, if is a finite union of compact locally 
arcwise connected components then for x < y we have for 

every measuring function. Since this assumption is not very restrictive, in what 
follows we shall stick to it. Therefore in general we shall be interested only in 
the behaviour of size functions in the set = {{x^y) G IR^ : x < y}. 

Another easy consequence of the definition of size functions is their mono- 
tonicity: size functions are non-decreasing in the first variable and non-increasing 
in the second one. 

For more details and results about size functions we refer the reader to [5], 
[7], [8], [9], [12], [15] and [16]. Algorithms for the computation of size functions 
can be found in [6] and [17]. The resistance of size functions under noise and 
occlusions is treated in [11]. 

3 Size functions and formal series 

Independently of the pair (AI,(^), discontinuities of size functions satisfy some 
general properties. The most remarkable, but not immediate, one is that if 
is discontinuous at (x, y) with x < y^ then either x is a discontinuity point for 
^(M,^){'^y) or ^ is a discontinuity point for •) or both the statements 

hold. 

Moreover, for x < ^, discontinuities in the variable x propagate downward 
to the diagonal A = {{x^y) G JR^\x = y} and discontinuities in the variable y 
propagate toward the right up to the diagonal. 

For every size function it is then possible to select special points in 5 q, called 
cornerpoints^ from which discontinuity points propagate horizontally and ver- 
tically toward the diagonal. More precisely, cornerpoints are defined as those 
points p = (x,^), with x < y^ satisfying the following property: if we denote by 
ya,p{p) the number 

P ^^y ~ P) ~ ^{m,lp){x P a^y P (3) — 

{x - a,y - (3) P {x-oi,yP (3), 

it must hold jJi{p)^= mm{jiia,i3{p) <y > 0, P > 0, x P a < y — f3} >0. We shall 
call jx{p) the multiplicity of p. It can be shown that jx{p) > 0 for every p G Sq. 

Under our assumptions on At a size function has at most a countable set of 
cornerpoints in Sq. Actually, for any fixed p > 0, the region Sp = {{x^y) G IR^ : 
X < y — p} contains only finitely many cornerpoints. 

In Figure 2 we show that the only cornerpoints for the size function depicted 
on the left are the points a, 6, c. All these cornerpoints have multiplicities equal 
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Fig. 2. A size function on the left and the corresponding cornerpoints (a, 6, c) 
and cornerlines (s) on the right. The resulting formal series is aQ{t) = s+a+6+c. 



to 1 but it is easy to construct examples with points having multiplicity greater 
than 1. 

Analogously one can select special vertical lines whose points above the di- 
agonal are discontinuity points for the size function. These lines are called cor- 
nerlines and are defined as those lines /, with equation x = /c, /c G IR, for which 
it holds 



!i{l)= min + a, y) - „)(fc - a, y) > 0. 

cx>Q,k-\-a.<y 

We shall call /i(/) the multiplicity of 1. From the monotonicity of size functions 
it follows that /i(/) > 0 for every vertical line 1. Our assumptions on Ai imply 
that a size function can have at most a finite number of cornerlines. 

The only cornerline for the size function of Figure 2 is the line s with mul- 
tiplicity equal to 1. In general, it can be shown that every arcwise connected 
component of A4 generates a cornerline with multiplicity 1. Since different com- 
ponents can give rise to the same cornerline, one can obtain cornerlines with 
multiplicity greater than 1. 

The importance of cornerpoints and cornerlines of a size function lies in the 
fact that for almost every point p = (x,y) lying above the diagonal, the value 
of the size function at p is equal to the sum of the multiplicities of cornerpoints 
(x, y) with X < X and y > y and of cornerlines x = k with k < x. That is to say 
that cornerpoints and cornerlines determine the size function almost everywhere. 

Let us now denote by IZ the set of all lines with equation x = k^ with k £ JR. 
For every p > 0, we shall call formal series in Sp U IZ any object of the form 
a = ^m{X)X with m{X) G X varying in U 7^ and m{X) ^ 0 only for 
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a countable set of X’s. The set of X’s such that m{X) 7^ 0 is called the support 
of the formal series. 

It is then natural to define a map ap from the set of all size functions into 
the set of formal series in Sp U IZ. For p > 0 it suffices to set ap equal to the 
map which takes a size function the formal series J]] p(X)X, where 

X varies in the set of all cornerpoints belonging to Sp and of all cornerlines for 
while /i(X) is its multiplicity (cf. Figure 2). For p = 0 one can take ao 
to be the map “extending” all the maps ap for every p > 0. For every p > 0, a p 
induces a well defined map ap from the set Cp of all size functions quotiented by 
the relation of coincidence almost everywhere in Sp into the set of formal series 
in Sp U 1Z. 

Theorem 1. For every real number p > 0 the map ap : Cp — > SpVJlZ is 
injeetive. 

As a consequence it remains proven that size functions can be represented as 
particular formal series of points and lines of the plane. 

Furthermore, for every p > 0, it is possible to characterize a particular subset 
of the set of formal series in SpUlZ on which ap turns out to be also surjective. 

For the proofs of the results stated in this section we refer to [10]. 



4 Deformation energy 

It follows from Theorem 1 that we can reduce the problem of assigning an energy 
to the deformation of a size function to the problem of giving an energy to the 
deformation of the associated formal series. This is clearly an easier task since 
formal series allow a concise and algebraic representation of size functions. 

The first step in order to assign an energy to the deformation of a size function 
is therefore that of defining what a deformation of the associated formal series 
is. A very natural way to do this is illustrated by the example in Figure 3. In 
such an example we consider two formal series ai = 5 + a + 6 + c and <J2 = 
obtained as images of two size functions by the map a^ of Section 
3. The discontinuity points of the two size functions are represented by dashed 
and continuous lines respectively. We deform cri into <J2 by transforming the line 
s into the line the points a and b into and b^ respectively, and by sending 
c onto the diagonal (this last action has the same meaning as “destroying” the 
point c). 

This example can be generalized as follows. 

Let us fix a positive real number p. Then every formal series obtained as 
image of a size function by the map ap has finite support in Sp U IZ. For the sake 
of simplicity in what follows the considered formal series will always have finite 
support in Sp U IZ. Nevertheless our approach can be extended to the general 
case when the support is not finite. Moreover, we shall consider each point p 
having multiplicity m{p) as m{p) distinct points. 
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Fig. 3. A possible way to deform a size function, represented by dashed lines, 
into another one, represented by continuous lines. 



Definition 1. We shall call elementary deformation of a formal series a any of 
the following transformations: 

(a) Identity transformation leaving a unchanged. 

(b) Transformation taking a to a — X with X ^ X^ and either both X and 
X^ belonging to Sp or both belonging to IZ. 

(e) Transformation taking cr to cr + X with X G U 7^. 

(d) Transformation taking cr to cr — X with X e SpUTZ. 

Remark 1. We recall that coefficients of formal series are allowed to be negative. 
Applying an elementary deformation of type (b) means transforming either a 
point p into a different point or a vertical line I into a different line An 
elementary deformation of type (e) has the effect of creating either a new point 
or a new line. Analogously, an elementary deformation of type (d) has the effect 
of eliminating either a point p or a line 1 . 

Definition 2. We shall call deformation of a formal series any finite ordered 
sequence of elementary deformations of the considered formal series. We shall 
denote by T(cr) the result of the deformation T applied to the formal series cr. 
Given two deformations T and *5, the juxtaposition ST means applying first T 
and then S. 

Remark 2. The effect of each deformation of formal series is independent of the 
order in which elementary deformations are carried out. 



50 



Pietro Donatini et al. 



Remark 3. The inverse of an elementary deformation of type (b) is again an 
elementary transformation of the same type. The inverse of an elementary de- 
formation of type (c) is an elementary deformation of type (d) and viceversa. 

In general, the inverse of a deformation T will be denoted by the symbol 

2^ (8)1 

Remark 4- Given two formal series (with finite support) there exists an infinite 
number of deformations transforming such formal series into each other. In fact, 
if T and S are deformations of a formal series cr, then T(cr) = TSS®^{cr) but 
T ^ TSS^\ 

The following step is that of defining what the energy of a deformation is. 

Definition 3. For every deformation T of a formal series, the associated en- 
ergy E{T) is a real number greater than or equal to 0 satisfying the following 
properties: 

if T is the identity transformation, then S{T) = 0; 
n) if T = T 1 T 2 • • • T, then S{T) = 

Hi) £{T) =f(T^i). 

We allow the possibility that £{T) is equal to + 00 . 

As an example of deformation energy, one can consider the following one: if 
T is an elementary deformation of type (b) sending a point p to a point (resp. 
a line I to a line then £{T) is equal to the Euclidean distance between p and 
p^ (resp. I and /^); if T is an elementary deformation of type (e) or (d) creating 
or destroying a point p then £{T) is equal to the Euclidean distance between p 
and the diagonal Z\; if T is an elementary deformation of type (e) or (d) creating 
or destroying a line I then £{T) is equal to +oc. 

The deformation energy thus defined will be used for the experiments de- 
scribed in the following section. 

The last step consists of deducing a measure of the similarity between two 
formal series from the measure of the energy necessary to deform them into each 
other. 

Definition 4. Eor every pair of formal series <Ji and ct 2 the similarity between 
(7i and (72 is the number i7(cri,cr2) = inf{f(T)} where T runs in the set of all 
the deformations taking cji to ct 2 . 

Remark 5. We point out that if £ is the deformation energy defined in the above 
example, then mf{£{T)} is actually a minimum. Anyway, in general inf{f (T)} 
needs not to be attained. 

Let us finally observe that E is actually well suited for a comparison between 
formal series, and hence between size functions, because it satisfies the following 
properties, making of a pseudo-distance. 

Proposition 1. The following properties hold: 

1) for every pair of formal series <Ji and a 2 , E{ai,a 2 ) > 0; 



Deformation Energy for Size Functions 



51 



2) for every formal series a, i7(cr, a) = 0; 

3) for every pair of formal series cji and cf 2 , U{cri,a 2 ) = i7(cr2,cri); 

4) for every triad of formal series a 2 and as, U(ai,as) < (( 71 ,( 72 ) + 

I^(a2,as). 

Proof Property 1) is trivial. Since I^{a, ( 7 ) > 0 for every formal series a, property 
is easily proven by taking T equal to the identity transformation and by 
applying of Definition 3. 

Property 5*^ is a direct consequence of property Hi) of Definition 3. 

Finally, property is proven by observing that for every sequence of positive 
real numbers with lim^n +n = 0, there exists a sequence of deforma- 

tions {T^}iU IN taking a± to (72 and verifying the inequality i^((7i, ( 72 ) > f (T?) — e^. 
Analogously, there exists a sequence of deformations taking (72 to as 

and verifying the inequality U{a 2 ,as) > £{T^) ~ Thus for every z G IN it 
holds that U{ai,as) < £{T^^ = £{T^) + £{T^) < T’((7i, ( 72 ) + U(a 2 ,as) + 2e^ 
The desired inequality is obtained for z ^ +00. 

5 Experimental results 

We have tested the method here described in the task of recognizing B/W im- 
ages. More precisely, we have considered 15 drawings belonging to three classes 
of objects: scissors, open end wrenches, screwdrivers. For each drawing we have 
computed the size function of its outer edge with respect to the measuring func- 
tion distance from the center of mass. Then size functions have been transformed 
into formal series. These two operations have been carried out by means of the 
program SketchUp (for more details about this classifier of free-hand drawn 
sketches we refer the reader to [1]). Finally the similarity between images has 
been calculated by using the deformation energy suggested in Section 4. 

The results are displayed in Table 1, showing that the deformation energy 
can be used to easily distinguish the considered shapes. Indeed, the average 
similarity between size functions corresponding to drawings of the same class, 
displayed in the gray rectangles, is smaller than that between drawings belonging 
to different classes. 

6 Conclusions 

In this paper we have shown how an energy minimization paradigm can be suc- 
cessfully used to compare size functions. This approach transforms the problem 
of shape comparison into a minimization problem with respect to deformations 
of finite sets of points and lines of the plane, that is a much simpler task. A 
concrete application of this technique has been given. 
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Abstract. Consider the problem of fitting a finite Gaussian mixture, 
with an unknown number of components, to observed data. This pa- 
per proposes a new minimum description length (MDL) type criterion, 
termed MMDL (for mixture MDL), to select the number of components of 
the model. MMDL is based on the identification of an “equivalent sample 
size”, for each component, which does not coincide with the full sample 
size. We also introduce an algorithm based on the standard expectation- 
maximization (EM) approach together with a new agglomerative step, 
called agglomerative EM (AEM). The experiments here reported have 
shown that MMDL outperforms existing criteria of comparable compu- 
tational cost. The good behavior of AEM, namely its good robustness 
with respect to initialization, is also illustrated experimentally. 



1 Introduction 

Finite mixtures are a flexible and powerful probabilistic modeling tool. In sta- 
tistical pattern recognition, mixtures allow a formal (model-based) approach to 
(unsupervised) clustering [7], [15]; in fact, mixtures adequately describe situa- 
tions where each observation is modeled as having been produced by one of a 
set of alternative mechanisms [31]. However, strict adherence to this interpreta- 
tion is not required. Mixtures can simply be seen as models able to represent 
arbitrarily complex probability density functions (pdf’s); this makes them an 
ideal tool for representing complex class-conditional pdf’s in supervised learning 
scenarios (see [6], [22] and references therein). 

This paper is devoted to the problem of fitting Gaussian mixtures with un- 
known number of components to multivariate observations. The two fundamental 
issues to be dealt with are: (a) how to estimate the number of components, for 
which several techniques (reviewed below) have been proposed; and (b) how to 
estimate the parameters defining the mixture model. For this second question, 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 54-69, 1999. 

[b Springer- Verlag Berlin Heidelberg 1999 
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the standard answer is the expectation-maximization (EM) algorithm, but sev- 
eral authors have also advocated the (much more computationally demanding) 
Markov chain Monte- Carlo (MCMC) method. 

We propose a new criterion to estimate the number of components which is 
shown experimentally to outperform existing methods of comparable computa- 
tional cost. Our criterion is a modified version of the minimum description length 
(MDL) principle, based on what can be called the “equivalent sample size” . We 
also introduce an (EM-based) algorithm aimed at mitigating the initialization 
dependence that makes EM difficult to use in practice. Erom a clustering perspec- 
tive, our algorithm can be seen as an agglomerative hierarchical-type scheme, 
thus we term it agglomerative EM (AEM): we start with a large number of com- 
ponents (clusters) and evolve towards a small number of components. Erom a 
density estimation perspective, our algorithm has a multi-scale flavor: we go from 
a fine-scale representation with a large number of components, thus potentially 
irregular, to a smoother /coarser one with fewer components. 

We review relevant previous work on mixture model fitting in Section 2, 
which also serves to introduce notation and the EM algorithm. Section 3 presents 
the MMDL criterion, while Section 4 is devoted to AEM. Section 5 reports 
experimental results, and Section 6 presents our conclusions. 

2 Fitting Mixture Models 

2.1 Introduction 

Let Y = [Yi, ..., be a d- dimensional random 

variable, with y = [^i, ..., ^^^]^ representing one particular outcome of Y. It 
is said that Y has a finite mixture distribution if its probability density function 
can be written as 

k 

h{y\&(k)) = Y ^rnfY{y\0m), (1) 

m=l 

where k is the number of components, each /Y(y|^m) is called a component 
density function, and the am (Z]m=i ~ mixing probabilities. 

Assuming that all the components have the same functional form (e.^., they are 
all d- variate Gaussian), each one is fully characterized by the parameter vector 
6i. Let ©(fe) = oi, ..., o/c(g)i} be the parameter set defining a given 

mixture (notice that 0 /^ = 1 — and A4(^k) fbe space of all possible 

/^-component mixtures built from a certain class of pdf’s. This paper focuses 
on mixtures of Gaussian components, denoted as /Y(y|^m) = -^(ylMm^^m), 
where 6m = (Mm? if arbitrary covariance Cm and mean are assumed; 
if a common covariance C is adopted, we simply write Om = Mm- 

The maximum likelihood (ML) estimate of 0(/c), based on a set of n inde- 
pendent observations yobs ~ is 

&(k) = argmaxL (©(fe),yobs) > 

©(fc) 



( 2 ) 



56 



Mario A.T. Figueiredo et al. 



where L Yobs) log-likelihood function 



n n k 

L ( 6 >(ik),yobs) = logn/Y(y^*^|0(fc)) = ^ log ^ am/Y(y^*^| 6 >m)- (3) 

i=l i=l m=l 

In general, Eq. (2) has no closed-form solution but it can be approached quite 
easily via the expectation-maximization (EM) algorithm [16], [31]. 



2.2 The EM Algorithm for Gaussian Mixtures 



Behind the EM algorithm is the interpretation of the set of observations Yobs ~ 
as incomplete data, with the missing information being a corre- 
sponding set of labels Zj^^jgg = [16], [31]. Each of these labels has 

the form = [z^] , with Zm = 1 and Zp^ = 0 , for p 7 ^ m, if and 

only if was produced by the m-th component of the mixture. This complete 
data setup agrees with the interpretation of a mixture density as a model of a 
two-step data generation process: first, randomly choose one of the k available 
“data generators” with probabilities {ai, ..., a/c}; then, produce a sample from 
the chosen “generator”. 

The loglikelihood function based on the complete data {Yobs’ ^missl’ denoted 
^c( 6 ^(/c )5 Yobs’ ^miss)’ easily found to be (for details see [31]) 



Lr(& 



(k) 5 Yobs’ ^miss) = EE Zm log bm/Y(yl-^l|0m) 



jf=l m=l 



( 4 ) 



In its general form, the EM algorithm proceeds by successively applying two 
steps to produce a sequence of parameter estimates ..., 0 ^^^^^, ...}: 

E-step: Compute the expected value of the complete loglikelihood, conditioned 

-(t) 

on the observed data and on the current parameter estimate 0 (/c), 

Q 5 ^(/c)^ ~ ^ -^c (^(/e) 7 Yobs’ ^miss) /Zj^jgg (^niiss l^(/c) ’ Yobs) dZj^iiss- 
M-step: Update the parameter estimates according to 



^(/c) = arg max Q I 0 (/c) , 0 (/e) 

^(fc) V 



( 5 ) 



Under mild conditions [16] , EM converges to a (local) maximum of L (0(/c) , Yobs) • 
The key to the efficient implementation of this algorithm is the choice of an 
observed/missing data structure, z.e., the function Lc (^(/c )7 Yobs’ ^miss)’ such 
that the E and M steps have simple closed- form expressions. This is the case in 
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Eq. (4), which is linear in the missing variables, thus reducing the E-step to the 
computation of the conditional expectation of the variables [16], [31], 

am 



iht) = ^ 



l^(/c) 5 yobs 



(6) 



E3f/Y(y«|e„') 

i=i 



The M-step also has a simple closed form solution (recall that 6m = ^m})- 

= (7) 






i=l 



< 8)1 



\i=l y 






(8) 



i=l 



8)1 



cr"= E«&‘M E (y'" - (y‘” - (f) 



vi=l 



i=l 



The main difficulties in using EM for mixture model fitting are: its critical 
dependence on initialization; the possibility of convergence to a point on the 
boundary of the parameter space with unbounded likelihood (z.e., one of the 
am parameters approaching zero with the corresponding covariance becoming 
arbitrarily close to singular). 



2.3 Estimating the Number of Components 

It is well known that the ML criterion can not be used to estimate the num- 
ber of mixture components because M.(k) ^ *^(/c+i )5 foi* example, = 

{di,...,ek,ai,...,ak^i} and 0 ^ 1 ) = {^i> •••> such 

that dk = 0^ = and ak = ct^+i + ct^ (where, of course, = 1 — aj, 

and a^_^^ = 1 — represent intrinsically indistinguishable mixture den- 
sities. Consequently, the maximized likelihood , y^j^g) is a non-decreasing 

function of /c, thus useless as a model selection criterion. This is a particular in- 
stance of the identifiability problem (see, e.g., [31]). As also pointed out in [31], 
classical (y^ based) hypothesis testing is not useful here because the necessary 
regularity conditions are not met. 

Several approaches are available to estimate the number of components of 
a mixture; from an algorithmic standpoint, they can be divided into two main 
classes: EM-based techniques and stochastic techniques. 



EM-based approaches use the (fixed k) EM algorithm to obtain a sequence 
of parameter estimates for a range of values of k, •••5 ^max}, 

with the estimate of k being defined as the minimizer of some cost function, 

fc = arginin|c( 0 (fe),fc) , = fcmin, •••, fcmaxj • ( 10 ) 
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Most often, this cost function includes the maximized log-likelihood function 
plus an additional term whose role is to penalize large values of k. 

Under this general formulation, we find the MDL criterion [23] in which the 
cost function is 



where N{k) is the number of parameters needed to specify a /^-component mix- 
ture. For arbitrary means and covariances, N{k) = {k — 1) k{d + d{d H- l)/2) 
(recall that d is the dimension of Y); if a common covariance is assumed, then 
N{k) = {k-l)^kd^ d{d l)/2. 

Several EM-based approaches also use approximate versions of the Bayes 
factor (the correct Bayesian model selection criterion [9]), such as the evidence- 
based Bayesian (EBB) criterion [25], the approximate weight of evidence (AWE) 
[1], and Schwarz’s Bayesian inference criterion (BIG) [5]. Although derived in 
a different framework, BIG formally coincides with MDL and is also given by 
Eq. (11). The minimum message length (MML) criterion [20], Akaike’s informa- 
tion criterion (AIG) [35], and Bezdek’s partition coefficient (PG) [3] are other 
approaches in this class. As pointed out in [25], EBB, MDL/BIG, and MML 
perform comparably and outperform all other methods against which they were 
tested. Goncerning AWE, it is argued in [5] that MDL/BIG provides a better 
approximation to the true Bayes factor. The AIG and PG criteria were shown in 
[20] (based on tests on 20 different mixtures) to be outperformed by MML and 
MDL/BIG. Accordingly, any new method in this class need only be compared 
against EBB, MDL/BIG, or MML. Einally, drawbacks of MML and EBB are: 
MML can not be used for certain values of d (for example d = 9 and d > 24) 
[25]; both EBB and MML depend on arbitrarily chosen parameters which can 
critically influence its results. 

Resampling-based schemes [14] (which have also been used in a clustering 
framework [8]) and cross-validation approaches [30] are (computationally) much 
closer to stochastic techniques (see below) than to the methods in the previous 
paragraph and will not be further considered here. 



Stochastic approaches involve Markov chain Monte Garlo (MGMG) sampling 
and are far more computationally intensive than EM. MGMG is used in two 
different ways: to implement model selection criteria to actually estimate k (e.^., 
[2], [18], [26]); and, in a more “fully Bayesian” way, to sample from the full a 
posteriori distribution with k considered unknown [19], [21]. Despite their formal 
appeal, we think that MGMG-based techniques are still far too computationally 
demanding to be useful in pattern recognition applications. Eor example, tests 
reported in [21], using small samples (n = 245, 155, 82) of univariate data, require 
100000 MGMG sweeps following a so-called burn-in period of another 100000 
sweeps; this is a huge amount of computation for such small problems. 




( 11 ) 
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2.4 Initialization of EM 

-( 1 ) 

The EM algorithm requires an initial parameter setting &(^k) initial asso- 

ciation of each observation to one of the components (z.e., an initial setting of 
[16], [31]. This is a critical issue because EM converges to a local maxi- 
mum of the likelihood function: the final estimate depends on the initialization. 
There are several different approaches to deal with this difficulty. Running EM 
several times, from random initializations, and then choosing the final estimate 
that leads to the highest local maximum of the likelihood is a commonly used 
technique (e.^., [17] and [25]). Another common procedure is to use some clus- 
tering method to provide an initial partition of the data [17]. Einally, we mention 
the deterministic annealing (DA) EM algorithm (DAEM); DA is a fast surrogate 
of the (stochastic) simulated annealing approach to global optimization, which 
has been successfully applied in several problems [27]. In particular, for mixture 
estimation, DAEM avoids some of the initialization dependence of EM [10], [32], 
[36]. All these choices pay a high price in terms of computational efficiency. 

3 The MMDL Criterion 

It was shown in [25] that MDL/BIC (although simpler) performs comparably 
with EBB and MML, although it sometimes slightly underestimates the true 
k. A similar conclusion can be obtained from the many (20) tests described 
in [20]. It was also reported in [11] and [29] that MDL/BIC tends to slightly 
underestimate the true order. In order to overcome this problem, let us look 
again at the MDL criterion in Eq. (11). The meaning of the MDL cost function 
is the total code length of a two-part code for the observed data YqPq and the 
parameter estimate (see [23], for details and motivation): first encode the 
data, given 0(/c); then, encode 0(/c). Eormally, Eq. (11) is of the form 

Cmdl (yobs’ ®(fc)) = ^ (yobs’ ®(fc)) 

= ^ (yobsl®w) + ^ (®w) ’ (12) 

where >^(yobsl^(^)) ~ ^obs) well-known Shannon’s optimal 

code length^. The second code- length, £(0(/.)), results from the following rea- 
soning. To obtain finite- length codewords for 0(/c), its (real- valued) elements are 
truncated to some finite precision. With a coarse precision, C{0(^k)) is small but 
the encoded parameters may be far from the optimal ones and so the first part 
of the code may become longer. With a finer resolution, the encoded parameters 
will be close to the optimal ones, but longer codewords are required. As shown 

^ As is usually done, we are ignoring the integer constraint on code- lengths and dis- 
regarding that we are dealing with densities, not probability masses. Discretization 
would lead to probability masses and a common (thus irrelevant) additional code 
length term [23]. 
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in [23], the optimal code- length for each real parameter, asymptotically for large 
n, is (l/2)logn; this leads to Eq. (11). 

In most problems where the MDL/BIC criterion is used, all data points have 
equal importance in estimating each component of the parameter vector. This is 
not the case in mixtures, where each data point has its own weight in estimating 
different parameters, as is clear from Eqs. (8) and (9). This fact is revealed if we 
compute the Eisher information of a parameter of the m-th mode of the mixture 
(denoted generically as Om) which leads to (see [31]) 

I{9m) =namh{0m), (13) 



where li{0m) denotes the Fisher information associated with a single observation 
known to have been produced by the m-th component density, ie.. 



hiOm) = -E 



■ 02 

del 



log/y (yl^'m) • 



What Eq. (13) shows is that a parameter Om “sees” an equivalent sample size 
equal to rather than n. This is intuitively acceptable because Om will basi- 

cally be estimated from the data that “was generated” by the m-th component 
of the mixture; the expected amount of this data is precisely n am- Applying this 
fact, while keeping the classical MDL code- length for the mixing probabilities 
(because these are estimated from all the data), we finally obtain the MMDL 
cost function 



Cmmdl (®(fe),fc) = (®(fc)>yobs) “I ^logn + 



= -L[0 



2 

N{k) 



/A \ W , 

Yobs j + ^— log n + 



^ (I) 1 ( ^ 

— ^ log(na„) 

m=l 

( 14 ) 



m=l 



< 0 



where AZ'(l) is the number of real parameters defining each component (see the 
paragraph after Eq. (11)). The MMDL cost function can also be interpreted 
from a BIC-type perspective as the inclusion of some of the o(l) terms that are 
dropped to obtain the classical form. 

In summary, the MMDL criterion introduces a lower penalty than MDL/BIC; 
notice that the new term that appears in Eq. (14) when compared with Eq. (11) 
is necessarily negative. This is a result of the identification of the amount of data 
which is effectively used in estimating the parameters of each component of the 
mixture. 



4 The Algorithm 

To implement the MMDL criterion we propose a new (EM-based) algorithm. Let 
^max be some number known to be considerably larger than the true/optimal k 
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(say, ktrue) and kmin be another number such that, for sure, kmin < ^true- The 
basic structure of the algorithm is as follows: 

Initialization: 

Set k < ^max* 

-( 1 ) 

Let 61 be some initial /^-component mixture estimate. 

Main Loop: 

While k > kmini repeat: 

-( 1 ) 

— Run EM, using 0(^k) initialization, until a stopping condition is 
met. Store the resulting mixture parameter estimate 0(/c). 

— Compute and store Cmmdl 

— Obtain a (/c — l)-component mixture, “close” (in a sense to be 
specified below) to the /c-component one specified by 0(/c). 

Let represent this {k — l)-component mixture. 

— Set k ^ k — 1 

Choosing the optimal k: 

Find the minimum of the stored MMDL cost function values: 

^MMDL = argm^in | Cmmdl ^ k = kmin, ^min + !,•••, ^maxj • 

The final mixture parameter estimate is 

(^MMDL ) 

The crucial aspect of the algorithm is the use of a (/c — l)-component mixture, 
“close” to the current /^-component one, to initialize the next run of EM. This is 
done by looking for the pair of components that are closer to each other and less 
probable and merge them into a single new component (see details below). For 
this reason, our algorithm shares some of the spirit of agglomerative hierarchical 
clustering schemes [7], thus we call it agglomerative EM (AEM). The first run 
of EM, due to the excessive number of components, is somewhat insensitive to 
initialization. Of course we are not claiming that AEM is guaranteed to find 
the globally optimal mixture estimate; it is known that even MCMC may have 
difficulties escaping from local maxima of the likelihood function [24] . 

AEM can be used with any criterion other than MMDL, or even when kt^ne is 
known: in this case, simply set kmin = ^true and skip the phase where the optimal 
k is chosen. Naturally, AEM can also be based on modified versions of EM [16]. 
Finally, observe that the computational requirements of AEM are the minimum 
possible for any EM-based method doing unknown order mixture fitting. EM 
only has to be applied once for each value of k, instead of the common approach 
of using a set of random initializations for each k. 

4.1 Initialization. 

For low dimensions {d = 1,2), the initial mixture is composed of kmax compo- 
nents uniformly spread over the region occupied by the observed data (defined 
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by the minimum and maximum observed values of each coordinate). For higher 
dimensions, a better initialization is obtained by clustering the data into /cmax 
groups using successive binary splitting and iF-means optimization at each stage 
[7]. As long as is large enough, AEM is quite insensitive to initialization. 



4.2 Stopping Conditions for EM. 

Each run of EM is stopped if at least one of the following two conditions is true: 



Condition 1: 



Condition 2: 



max 

and 

max 



fit' - nT" 

II pS II 





m = 1, 2, ..., k><6^ 



mm 



ch) 

m= < a 



, m = 1,2,..., A ; } < 6c 



(15) 



(16) 



Condition 1 checks if consecutive parameter estimates do not differ significantly; 
in all the examples below, we set (5^ = = 0.001 and use infinity norms || • ||n . 

Condition 2 looks for a component whose probability is becoming too small; we 
typically use <amin = 5d/n. Condition 2 avoids one of the known problems of 
EM mentioned earlier (convergence to the boundary of the parameter space). 



4.3 Obtaining the (k — l)-Component Mixture. 

The {k — l)-component mixture is obtained by merging two components of the 
/^-component one. We start by locating the pair of mixture components, say 
mi and m 2 , that are closer to each other and, simultaneously, less probable. 
Specifically, we choose mi and m 2 as 



(mi,TO 2 ) = argmin|(Si + 3j)Ps /Y(y|^i), /Y(y|^i) , i j] , (17) 

ihj) ^ L -I 



where Ps[/Y(y|^z), fY{y\0j)] is the symmetric Kullhack-Leihler (KL) divergence 
[12], the standard dissimilarity measure between probability densities [12]. The 
Jensen- Shannon divergence (see [13]) would be a natural candidate, because 
it allows weighting differently the two probability functions being compared; 
however, it does not have a closed form expression for Gaussian densities and 
so we settled for the KL divergence. In the Gaussian case, the symmetric KL 
divergence is [12]: 

K. [.A^(y|fi.C,),A-(y|M,,C,)] = itr[(C, -C,)(Cf -Cf)] 

+ [Cf +Cf]“ 

If EM was stopped by Condition 2 (Eq. (16)), we force mi to be the component 
responsible for making it true. We then choose m 2 by Eq. (17), fixing i = m\. 
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Consider now the sub-mixture Q^^,/Y(y|^mi) + <^m 2 /Y(y|^m 2 ), where = 

+ <^ 7712 ) <^m 2 ~ ^ ~ * Merging the two components of this 

submixture is equivalent to finding the parameters /jP and of the “closest” 
Gaussian density. If “closeness” is taken in the KL sense, then 

(/i°,c°) = argminP , 

which has a simple solution (see [34], Chapp. 12 ): fj,^ and are the global mean 
and covariance of the given two-component mixture, z.e., 

( 18 ) 

( 19 ) 

This means that when merging components mi and m 2 of the mixture, the 
resulting component must retain the combined probability, mean, and covari- 
ance. Assume, without loss of generality, that m 2 = k, which can always be 
achieved by resorting the components. Merging component mi < k and m 2 = k 
of the /^-component mixture given by {ofm, Mm 5 m = 1 , ..., /c} then yields a 
(k — l)-component mixture defined by Mm? = — 1 }, where 

f ^7775 ^ 7^ mi 

\ ^7771 T ^7772 ? ^ ~ ^ 1 : 

J Mm? m^mi 

] 0^7771 Mmi T <am2Mm2 ™ _ 

^ Jf- ry ^ ni /Ul, 

{ Cm, TO 7 ^ mi 

0^7771 (Cmi + MmiMmi) ^ 0^7772 7772 T Mm2 ^ 7772 ) D CF” ^ ^ 

-h Qfm 2 MmiMmi?^-^!* 

5 Experimental Results 

This section is divided into two parts: the first one basically illustrates the work- 
ing of the AEM algorithm showing how it evolves from a redundant mixture to 
successively lower order ones, and how this avoids the need for careful initializa- 
tion. The second part focuses on MMDL by presenting examples (with synthetic 
and real data) where it overcomes the under-fitting tendency of MDL/BIC. 

5.1 The AEM Algorithm 

The first example uses 1000 samples from a mixture of 3 univariate Gaussians 
with means /ii = M 2 = 0 , and Ms = b, and standard deviations <Ji = 1 , <J 2 = \/6, 
and (73 = 1; mixing probabilities are ai = 0.3, 02 = 0.4, and 03 = 0.3. Fig. 1 
shows AEM evolving from an 8 -component mixture (after starting at kmax = 12) 
to just two components. Observe the above mentioned multi-scale flavor of the 
method in the evolution from more erratic density estimates to smoother ones. 



= 



M77 



= 
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Fig. 1. Mixture estimates for k = 8,5,3 (the true value), and 2, obtained by 
AEM. Thin lines show the component densities multiplied by the corresponding 
probabilities, while the thick line plots the resulting mixture. The gray bars 
represent a (normalized) histogram of the observations. 



The MMDL estimates are /c = 3, /ii = 0.09, /I2 = 0.11, /is = 5.97, = VO.87, 

?2 = Vo. 12, $3 = \/l-ll, Si = 0.32, 0,2 = 0.38, and S3 = 0.30. 

For the next example, 1500 samples were drawn from a mixture of 3 bivariate 
Gaussians with 0^1= 0^3 = 0.3, 0^2 =0.4, =/l ^2 = [~4, — 4]^, /L63 = [3, 3]^, 





■ 1 


0.5‘ 




■ 6 -2' 


Cl = 


0.5 


1 


, C2 = 


-2 6 



Fig. 2 shows the algorithm evolving from its initialization (a set of /cmax = 9 
similar and uniformly spread Gaussians) to the correct 3-component mixture. 
Notice how different the initial mixture is from the true one and how AEM 
was able to overcome this poor initialization. The final parameter estimates are 
= [-4.03, -4.12]^, /x2 = [-4.01, -3.90]^, ^3 = [3.08, 2.91]^, 



■1.07 0.56' 




5.4 -1.89' 


0.56 0.88 


, C2 = 


-1.89 6.12 



and C 



3 



2.10 -1.14 
-1.14 2.17 



Einally, we study the well known IRIS data set^ that consists of 50 (4- 
dimensional) samples of each of the three classes present: Versicolor^ Virginica, 
and Setosa. Starting with /cmax = 8, both MMDL and MDL/BIG correctly se- 
lected k = 3. Using the corresponding parameter estimates to build a maximum 



Available, e.g., at http://www.ics.uci.edu/pub/machine-learning-databases/ 
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Fig. 2. Initialization and sequence of mixture estimates for k = 9, 7, 6, 4, and 3 
(the ellipses are isodensity curves of each component). 



a posteriori classifier according to 

fn(y(*)) = argnrnxjSm /Y(y^*^|^m)| , 

we find that only two samples get misclassified (one Versicolor is classified as 
Virginica and one Virginica as Versicolor). This is even a little better than the 
three errors reported in [25]; more importantly, it is obtained without multiple 
random starts of EM. 



5.2 Comparing MMDL versus MDL/BIC 

Univariate Data. We start by considering two real univariate data sets for 
which MMDL and MDL/BIC yield different estimates of the number of Gaus- 
sian components: the Old Faithful geyser eruption durations (well known in the 
density estimation literature [28]), and the enzyme activity data from [21]. Ta- 
ble 1 reports the values of Cmmdl (*) and Cmdl/bic (*) foi* several values of k for 
these two data sets. Fig. 3 shows the resulting mixture density estimates. For 
the Old Faithful data, MMDL allows an extra component (/cmmdl = 4) with 
which the resulting mixture adjusts better to the skewness of the right portion 
of the histogram. For the enzyme data, the additional component in the mix- 
ture selected by MMDL yields a clearly better fit to the observed histogram. Of 
course, in this real data cases, there is no underlying true mixture, and so there 
is no way to tell what is the correct number of components and we must rely 
on visual evaluation. An alternative would be to perform a (leave-one-out type) 
cross validation study comparing the MDL/BIC and MMDL criteria. 
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Old Faithful 


k 


1 


2 


3 


4 


5 


MMDL 


429.8 


288.8 


283.6 


282.2 


286.7 


MDL/BIC 


429.8 


293.2 


289.4 


291.1 


287.8 


Enzyme 


k 


1 


2 


3 


4 


5 


MMDL 


236.3 


66.9 


65.8 


67.4 


73.9 


MDL/BIC 


236.3 


71.2 


72.5 


77.4 


87.5 



Table 1. MMDL and MDL/BIC cost function values for several values of k for 
the Old Faithful and enzyme data sets. 






Fig. 3. Mixture estimates produced by the MMDL and MDL/BIC criteria for 
the Old Faithful (top row) and the enzyme data (bottom row). 

Multivariate Data. To test the MMDL criterion on multivariate data, we have 
considered a mixture with 8 components on a 3D sample space. The component 
means are located at the vertices of a cube of side Z\, 



Ml = 


"O' 

0 


: M2 = 


"z\' 

0 


, M3 = 


"o' 

z\ 


, . . . , ^6? — 


’o' 

z\ 


, Ms = 


z\ 




0 




0 




0 




z\ 




z\ 



and all have unit covariance matrix = diagjl, 1, 1}, for i = 1,2,..., 8. We 
obtained 50 sets of 1200 samples each, for three different separations among 
the mixture components: A = 3, 3.5, and 4. Figure 4 shows, for these three 
values of Z\, the number of times that each value of k was chosen by MMDL and 
MDL/BIC. Notice how the performance of MDL/BIC degrades faster than that 
of MMDL. For this test, since the goal is to study the behavior of the MMDL 
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and MDL/BIC criteria, not of the AEM algorithm, we have used /cmax = 8 and 
the true parameters as initialization. 



B mmdl 

MDL/BIC 



B mmdl 
MDL/BIC 





Mit 



Fig. 4. Top row: histograms of the estimates of k (true value is 8) for Z\ = 4, 
3.5, 3. Bottom row: examples of the first two components of the sample sets. 



The MMDL criterion was also used in [33] , for a Bayesian image classification 
problem. The class-conditional densities are represented by Gaussian mixtures, 
learned via a vector-quantization (VQ) approach, with the MMDL criterion con- 
trolling the size of each VQ. Given the very high dimensionality of the feature 
space (> 100), N{1) is very high and MDL/BIC always yielded uselessly small 
estimates of k. With the estimates provided by MMDL, the resulting Bayesian 
classifier exhibited very good performance. 

6 Conclusions and Further Work 

We have proposed a new criterion to select the number of components in Gaus- 
sian mixtures and a new algorithm specially suited for mixture model estimation 
with an unknown number of components. The new criterion, called mixture MDL 
(MMDL), is a simple modification of the standard MDL/BIC, resulting from the 
identification of what can be called the equivalent sample size for each compo- 
nent. The proposed algorithm is based on EM together with an agglomerative 
step, thus it is called agglomerative EM (AEM). We have presented examples 
illustrating the behavior of AEM and its robustness with respect to initialization 
(although a more complete set of tests is still required). To compare MMDL ver- 
sus MDL/BIC, we have performed experiments on real and synthetic data. All 
the experiments confirm that MMDL allows a better fit to the observed data. 
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Finally, we mention the parameterization of the covariance matrices (based 
on eigen-decomposition), introduced in [1] (see also [4]). That parameterization 
allows taking selected characteristics of the components to be common (for ex- 
ample, same shape, arbitrary orientation). MMDL can also be used to perform 
model selection among the options provided by that approach. The goal is to si- 
multaneously choose the number of components and decide which characteristics 
(if any) should be assumed common. 
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Abstract. In this paper, we propose two Bayesian methods for detecting 
and grouping junctions. Our junction detection method evolves from 
the Kona approach, and it is based on a competitive greedy procedure 
inspired in the region competition method. Then, junction grouping is 
accomplished by hnding connecting paths between pairs of junctions. 
Path searching is performed by applying a Bayesian A* algorithm that 
has been recently proposed. Both methods are efficient and robust, and 
they are tested with synthetic and real images. 



1 Introduction 

As is well known, junctions are the atoms of more complex processes or tasks 

- depth estimation, matching, segmentation, and so on - because these fea- 
tures provide useful local information about geometric properties and occlusions. 
Hence, methods for extracting these low-level features from real-world images 
must be efficient and reliable. Moreover the relation between junctions and spe- 
cific tasks must be investigated. In this context, mid-level representations, that 
encode spatial relations between junctions, may be useful to reduce the com- 
plexity of these tasks. In this paper we propose two Bayesian methods to detect 
and group junctions along connecting edges. 

Previous works on junction extraction can be classified as: edge-grouping 
methods, template-matching methods, and mixed strategies. Grouping algo- 
rithms use gradient information to build junctions (e.g. [6] and [13]), whereas 
template-based methods (e.g. [5] and [22]) are based on local filters. Mixed ap- 
proaches are based on a local filter followed by template fitting. Two examples 
of these methods are [20] and [17]. In the latter approach, whose implementation 
is called Kona, junctions are modeled as piecewise constant regions - wedges - 
emanating from a central point. Junction detection is performed in two steps: 
center extraction - based on a local operator - and radial partition detection - 
based on a template deformation framework that uses the minimum description 
length (MDL) principle [19]. However, the proposed strategy for finding wedges 

- dynamic programming - may be too slow for real-time purposes, and also the 
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robustness of the method may be improved. In the first part of this paper we 
propose a junction detector that evolves from Kona, and therefore it also pays 
attention to MDL principle. As in Kona, we first perform corner detection by 
using a robust filter. Then, we find the optimal number of wedges and also their 
image properties and location. In this case our strategy is based on the local 
competition of wedges and search can be done with a simple greedy procedure. 
This strategy is inspired on the region competition method [26] recently devel- 
oped for image segmentation. Our method is fast and reliable. Robustness is 
provided by the use of sound statistics. 

The use of junctions in segmentation, matching, and recognition, is the sub- 
ject of several recent works. In [11] junctions are used as breaking points to 
locate and classify edges as straight or curved. Junctions are used as stereo cues 
in [15]. In [16] junctions are used as fixation cues. In this work, fixation is driven 
by a grouping strategy which forms groups of connected junctions separated 
from the background at depth discontinuities. The role of corners in recognition 
appears in [12], where a mixed bottom- up/top-down strategy is used to com- 
bine information derived from corners and the results of contour segmentation. 
Finally, junctions are used in [9] to constrain the grey level of image regions in 
segmentation. In the second part of this paper we propose a method to connect 
junctions along edges. This method is based on recent results on edge tracking 
using non-linear filters under a statistical framework [7], [24], [2], and [25]. 

The rest of the paper is organized as follows: In section 2, we present our 
junction detector and some experimental results with synthetic and real images. 
The analysis of these results motivates our junction-connecting approach pre- 
sented in section 3. Grouping results are presented at the end of this section. 
Finally we present our conclusions and future work. 

2 Junction Detection and Classification 

2.1 Junction Model: Parameters and Regions 

The relation between the real configurations of junctions and their appearance 
is well documented in the literature [23], [14]. A generic junction model can be 
encoded by a parametric template 0 = (x, r, M, where: {x^y) is 

the center, r is the radius, M is the number of wedges, {Oi}^ with i = 1,2, ...,M, 
are the limits of the angular sections, and {Ti} are the intensity distributions 
associated to these angular sections (see Fig 1). 

We assume that potential junction centers (x, y) can be localized by a local 
filter. Examples of this operators are: the Plessey detector for corners [8] and 
the filters proposed in [10], and [1]. Here, we use SUSAN, a robust and fast 
non-linear filter that has been recently proposed [21]. SUSAN estimates the 
intensity homogeneity inside a circular domain (SUSAN measure). Corners have 
low homogeneity. In consequence they can be detected, provided that we use a 
good threshold. This principle can also be applied to find edges in the image. 

In order to avoid distortions near the junction center, we also discard a small 
circular domain centered at (x^y) with radius Rmim as suggested by Farida et 



72 



M.A. Cazorla et al. 



al. Then, r = Rmax ~ Rmim where Rmax is the scope of the junction. Moreover, 
although Kona provides a method for estimating the optimal value of r around a 
given center, its cost is prohibitive for real-time purposes. Then we assume that 
r can be estimated by the user. 

We also consider that a junction is defined by several regions of homogeneous 
intensity around the circular domain defined by (x, y) and r. Then, the problem 
of finding M, {Oi] and {T^} can be solved by analyzing the piecewise constant 
function associated to the junction. We compute a one-dimensional intensity 
profile by estimating, for each angle 0 G [0,27t], the averaged accumulated in- 
tensity Xq along the radius r. An example of circular domain and its associated 
profile is showed in Fig 1. 




Fig. 1. Top left: Parametric model. Top right: Discrete accumulation of intensity 
along a direction. Bottom left: Example of a junction without noise. Bottom 
right: Intensity profile of the junction. 
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Angular sections are mapped to intervals Si = {0\0 G in the 

profile. Then, we can formulate the problem of junction detection as the seg- 
mentation of the profile into homogeneous intervals. An interval Si is considered 
homogeneous when its intensity values are consistent with a given probability 
distribution Ti. Here, we assume a Gaussian model, so that Ti = P(XQ\iii^(Ti)^ 
where pi is the mean, and is the variance. This model allows us to cope with 
noise in real images. 

Applying the Region Competition framework [26] to the intensity profile, the 
segmentation task consists of minimizing the following energy function: 

M 

{-iogP{{Xe:e e 5i}|Mi,<Ti)} (1) 

i=l 

where: M is the number of angular sections, P{{Pe : 0 G cri) is the sum 

of the cost of coding each value Xq within the interval S[ according to a distri- 
bution P(20|/ii, <Ji). Assuming independent probability models for each interval, 
we have: 

\ogP{{Xe : 0 e Si}\pi,ai) = [ \ogP{Xe\pi,(7i)dO (2) 

JSi 

and the global energy function is {if i = M then S[-^i = Si): 

£j{0,{m,cri}) = ^- logP{ig\iii,ai)d0 (3) 

i=i 

where ^Si are the limits of the interval. As this function depends of M, this 
criterion pays attention to the MDL principle. 



2.2 Wedge Identification with Competitive Descent 

Wedge identification is performed by a greedy algorithm, that minimizes 
f j(0, {/ii, cTi}). Previously, we compute initial guesses for each interval. These 
guesses are given by the SUSAN edge detector applied to the intensity profile. 
In consequence, the initial number of wedges is always greater than or equal to 
the optimal one. Then, we will eventually need to merge regions. 

Then, we calculate the values of each interval, and perform gradient 

descent with respect to the limits ^Si- For each limit Os^ we have: 



dt 



+ 



= W logP(X|9|Aii®i,o-i®i)d6» 

£*' iogp(i.i,.,<r,)do| = ^ {[p(»)i2;;„ + [p(«)i5“ } 



des; 



= logP{l0s^ |/ii®i,cri®i)-logP(j0g \m,ai) = log ■ 



P(JosjMi,o-i) 



.(4) 
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where F{) is the primitive function of log P(X|/i, cr). This ratio determines the 
change of Os^ and it is equivalent to the classification rule for two categories [4] . 
If P(T 0 g. l/ii, (Ji) > P( 20 g. (Ji(g)i), i.e., if P^g. fits better to the distribution 

of the angular section Pi than to the distribution of Pi^i, then the value of the 
limit ^Si is decreased. Otherwise, it is increased. 

Considering our Gaussian assumption we have: 



P{Je\fj,,a) 



. — exp 



2a2 j 



and replacing 5 in 4: 



d^Sj 

dt 




1 / (^9sj 

2 



( 5 ) 



(6) 



In order to provide robutness we use the median value and the trimmed variance, 
instead of considering the mean and the variance. 

Interval merging is based on an statistical test. Instead of applying the Fisher 
distance as it is done in the Region Competition method, our algorithm merges 
two adjacent intervals when the difference between their medians is below a given 
threshold D (typically P = 10 — 15 units). Additional merging occurs when the 
length of the interval is below 7t/9. Moreover, false junctions with M = 2 are 
removed when their angle is close to tt. 



2.3 Junctions: Detection Results 

We have selected the following values for our experiments: Rmin = 2 , and 
Rmax = 5, and P = 10. In these conditions we obtain a average error of 9 
degrees. The processing time was 0.5 seconds in a Pentium II 233MHz under 
Linux. 

Our algorithm obtains good results with synthetic images (see Fig 2 (top)). 
However, experiments with real images show several problems that deserve more 
attention: (a) Our corner detection may generate a imprecise localization of the 
center, or it may not be detected, (b) Bad choices of r may generate several false 
limits (see Fig 2 (bottom)): when r is high we can invade another region, and 
generate distortions in the intensity profile, (c) Several false junctions may not 
have geometric meaning. These problems can be observed in Fig 2(bottom), and 
also in Fig 3. 

These problems motivate the extension of our work to perform junction 
grouping along connecting edges. As we will see in the second part of this paper, 
junction grouping allows us to remove false positives due to locality, to detect 
curved junctions, to localize undetected corners, and to correct poor localization. 
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Fig. 2. Top: Results with a synthetic image. Bottom: Example of a image from 
a corridor with curved junctions (left), and several zoomed areas from the same 
image (right). 
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Fig. 3. More results with real images. Most of X-j unctions have false wedges. 
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3 Connecting and Filtering Junctions 



3.1 Path Modeling and Edge Tracking 

Now we are interested in finding connecting paths^ i.e. paths that connect pairs 
of junctions along the edges between them, provided that these edges exist. More 
precisely a connecting path P of length N, rooted on a junction center (x, y) and 
oriented with the angle 0 between two angular sections, is defined by a collection 
of connected segments pi,P 2 , • • • ,Pm with fixed or variable length. We assume 
that the curvature of these paths must be smooth, so we also define orientation 
variables a i, 0 ^ 2 , , <aM( 8 )i 5 where is the angle between two consecutive seg- 
ments pj and Pj+i. Following the bayesian approach of Yuille and Goughian [25], 
the optimal path maximizes 



N 









N<s>l 

} + Y 

i=i 



Pb ajai+i -a-j) 

C(aj+i -ttj) 



} (7) 



The first term of this function is the intensity reward^ and it depends on the 
edge strength along each segment pj . Edge strength is modeled by a probability 
distribution P(/(p.)) of the responses of a non-linear filter = 0(|V/(pj)|), 
where |V/(pj)| is the magnitude of the gradient in the neighbourhood of the 
segment. As the distribution of these responses depends on the relative position 
between a segment pj and the edge, P(/(p.)) is defined in the following terms: 




Pon{f(p^)) 



otherwise. 



(8) 



where Pon(/(pj)), and Poff{f(p.)) are the probability distributions of the re- 
sponses of segments lying on and off the path. These distributions are obtained 
by gathering statistics of the responses of the filter when a segment is placed on 
and off the edges given by the gradient operator. 

The second term is the geometric reward: PG{a^-\-i \ <aj) = Pn G(<aj+i — 
aj) models a first order Markov chain on orientation variables ap Curvature 
smoothing is provided by a negative exponential density function 

Pa G(Ao;j) oc exp{-T|Ao;j|} (9) 



where: Aaj = aj+i — trj, A is the maximum angle between two consecutive 
segments, and C modulates the rigidity of the path. Additionally, P((aj+i — aj) 
is the uniform distribution of the angular variation, and it is included to keep 
both the geometric and the intensity terms in the same range. 

The Yuille and Goughian approach evolves from the work of Geman and 
Jedinak [7] on road tracking, and introduces the analysis of the gradient operator 
and the consideration of the geometric term. For the design of our intensity 
reward we have used the original filter of Geman and Jedinak, and we have 
applied it to the gradient obtained with the SUSAN edge detector. Our geometric 
reward is designed as suggested by Yuille and Goughian. These distributions are 
showed in Fig 4. 
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Fig. 4. Left: Intensity distributions Pon and Poff^ for the SUSAN gradient opera- 
tor, obtained with 200 samples. Right: Geometric distributions id] g, for C = 5.0, 
and A = 0.2 = tt/ 15 radians; and Uu g- 



3.2 Path Searching and Junction Grouping 

Finding straight or curved connecting paths in cluttered scenes may be a difficult 
task, and it must be done in a short time, specially when real-time constraints 
are imposed. Goughian and Yuille [2] have recently proposed a method, called 
bayesian that exploits the statistical knowledge associated to the intensity 
and geometric rewards. This method is rooted on a previous theoretical analysis 
[24] about the connection between the Twenty Question algorithm of Geman 
and Jedinak and the classical algorithm [18]. 

Given an initial junction center (xq, ^o) and an orientation the algorithm 
explores a tree in which each segment pj can expand Q succesors, so there are 
possible paths. The bayesian A^ reduces the conservative breadth- first behaviour 
of the classical A^ by exploiting the fact that we want to detect a target path 
against clutter, instead of finding the best choice from a population of paths. In 
consequence there is one true path and a lot of false paths. Then it is possible to 
reduce the complexity of the search by pruning partial path with low rewards. 
The algorithm evaluates the averaged intensity and geometric rewards of the last 
Nq segments of a path (the segment bloek) and discards them when one of these 
averaged rewards is below a threshold, i.e. when 



1 



j=zNo 



Pon {P} ) 
PoffiPi) 



}<T, 



{z-\-l)No^l 



( 10 ) 



where T and T are the intensity and geometric thresholds that modulate the 
pruning behaviour of the algorithm. These parameters establish the minimum 
averaged reward that a path needs to survive, and in consequence they are closely 
related to the probability distributions used to design both the intensity and the 
geometric rewards. They must satisfy the following conditions: 

- D{PoJf\\Pon) <T< D{Pon\\PoJf), ~D{U^ g||J^ g)<T< D{P^ g\\Pu g) 

( 11 ) 
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where D is the Kullback-Leibler divergence. The algorithm finds the best path 
that survives the pruning, and the expected convergence rate is 0{N). Typically 
the values of T and T are set close to their higher bounds. Additionally, if 
Pon diverges from Poff^ the pruning rule will be very restrictive. Conversely, if 
these distributions are similar, the algorithm will be very conservative. The same 
reasoning follows for P[] q and U\\ g- 

The application of this algorithm in the context of junction grouping moti- 
vates the extension of the basic pruning rule. We also consider the stability of 
long paths against shorter paths. Long paths are more probable to be close to 
the target that shorter ones, because they have survived to more reward prunes. 
Then, if is the length of the best partial path, we will also prune paths 

with lengths Nj when 

Nbest - Nj > ZNo (12) 

where Z > 0 sets the minimum allowed difference between the best path and 
the rest of the paths. Low values of Z introduce more pruning, and the risk of 
loosing the true path is higher. When Z is large, shorter paths can survive. 

The algorithm selects for expansion the best partial path that survives to the 
extended pruning rule. These paths are stored in a sorted queue. We consider 
that we have reached the end of a connecting path, when the center (xf^yf) 
of a junction is found in a small neighbourhood around the end of the selected 
path. In order to perform this test, we use a range tree, a representation from 
Computational Geometry [3] that is suitable to search efficiently within a range. 
The cost of generating the tree is 0(J log J), where J is the number of detected 
junctions. Using this representation, a range query can be performed with cost 
0(log J) in the worst case. 

Once a new junction is reached, the last segment of the path must lie on the 
limit Of between two wedges. Then, we use this condition to label the closest 
limit as visited. If the last segment falls between two limits and the angle between 
them is below a given threshold B, then both limits are labeled as visited. As 
the search of a new path can be started only along a non-visited limit, this 
mechanism avoids tracking the same edge in the opposite direction. 

However, the search may finish without finding a junction. This event is in- 
dicated by an empty queue. In this case, if the length of the last path expanded 
by the algorithm is below the block size Nq, we consider that this path em- 
anates from a false limit, and this limit is cancelled. Otherwise, the search has 
reached a termination point and its coordinates must be stored. If we find an- 
other termination point in a given neighbourhood, both paths are connected. 
This connection is associated to a potential undetected junction when the angle 
between the last segments of the paths is greater than 7t/9, the minimum angle 
to declare a junction. 

Our loeal-to- global grouping algorithm starts from a given junction and per- 
forms path searching for each non-visited limit. When a new junction is reached, 
its corresponding limit is labeled as visited. Once all paths emanating from a 
junction are tested, the algorithm selects a new junction. Connected junctions 
are grouped. Labeling avoids path duplicity. Robustness is provided by the fact 
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that an edge can be tracked in a given direction, if the search from the opposite 
direction fails. As we have seen previously, it is possible to join partial paths at 
termination points. False limits are filtered, and it is possible to discover new 
junctions, and also to correct the existing ones. False junctions are removed when 
all their paths fail. This method generates a mid-level representation. We can use 
the connectivity and the information contained in the paths for segmentation, 
tracking and recognition tasks. 



3.3 Grouping: Connecting Results 

We have tested our grouping algorithm both with the synthetic and the real 
images processed by our junction detector. Some results are showed in Fig 5. 
We have used the following parameters: branching factor Q = 3, with Qo = 
5 at the first step of the algorithm; block size Nq = 3 segments; maximum 
angle A = 0.2 = tt/ 15 radians; rigidity C = 5.0; divergences and thresholds: 
-D{Poff\\Poff) = -5.980, D{Pon\\Poff) = 3.198, T = 0.0, -D{Uu cm g) = 
—0.752, D{Pu g\\Pu g) = 0.535, T = 0.40, B = tt/ 6; and extended pruning 
parameter Z = 1.0. The approximate processing time is t = 3.0 secs, in a 
Pentium II 233Mhz under Linux. 

4 Conclusions and Future Work 

There are two main contributions in this paper: the junction detection method 
and the grouping approach. Both methods are inspired in Bayesian techniques 
and they are tested with synthetic and real images. Our junction detector is fast - 
it is based on greedy search - and reliable - because we use sound statistics. Our 
junction grouping approach finds connecting paths between pairs of junctions 
and it is also efficient and robust. This method allows us to filter false junctions 
and to discover undetected ones. Then, a mid- level representation is obtained. 
Future work includes the refinement of this structure and its use in segmentation 
and reconstruction tasks, specially in the context of robot navigation. 
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Abstract. This paper deals with hierarchical Markov Raudom Field 
models. We propose to iutroduce uew hierarchical models based ou a 
hybrid structure which combiues a spatial grid of a reduced size at the 
coarsest level with sub-trees appeuded below it, dowu to the fiuest level. 
These models circumveut the algorithmic drawbacks of grid-based mod- 
els (computatioual load aud/or great depeudauce ou the iuitializatiou) 
aud the modeliug drawbacks of tree-based approaches (cumbersome aud 
somehow artificial structure). The hybrid structure leads to algorithms 
that mix a uou-iterative iufereuce ou sub-trees with au iterative deter- 
miuistic iufereuce at the top of the structure. Experimeuts ou syuthetic 
images demoustrate the gaius provided iu terms of both computatioual 
efficieucy aud quality of results. Theu experimeuts ou real satellite spot 
images illustrate the ability of hybrid models to perform efiicieutly the 
multispectral image aualysis. 



1 Background: Hierarchical Energy-Based Models 

Many inverse problems from image analysis can be managed by designing an 
energy function U{x^y) which captures the interaction between a large number 
of unknown variables x = {xi)i to be estimated, and the observed variables 
-the measurements or data-, y = (%)j. The manipulation of this function is 
made tractable by its usual decomposition as a sum of loeal terms involving 
just a few variables at a time. This kind of problem is encountered in Markov 
random field-based approaches as well as in partial differential equation (pde)- 
based approaches, where, in the first stage, the energy depends on a continuous 
function x. Within the framework of Markov random fields, x and y are random 
vectors and we have the following relation between the joint distribution and 
the energy function: P{x^y) oc exp{ — f/(x, ^)}. As far as the inference of x is 
concerned, the Bayesian estimation theory provides two standard estimators: 
the Maximum A Posteriori (MAP) estimator which corresponds to the global 
minimizer of the energy fonction {x = argmax^, P{x\y) = argmin^, U{x,y)) and 
the Modes of Posterior Marginals (MPM) estimator (Vi, Xi = argmax^,. P{xi\y)) 
which requires the computation of marginals by summation over huge sets of 
configurations. 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 83-98, 1999. 

[b Springer- Verlag Berlin Heidelberg 1999 
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It turns out that for most energy-based models suitable for image analy- 
sis problems, one has to devise deterministic or stochastic iterative algorithms 
exploiting the locality of the model in order to conduct the MAP or MPM infer- 
ence. While permitting tractable single-step computations, the locality results 
in a very slow propagation of information. As a consequence, these iterative pro- 
cedures may converge very slowly. This motivates the search either for improved 
algorithms of generic use, or for specific models allowing non-iterative or more 
efficient inference. 

So far, the more fruitful approaches in both cases have relied on some notion 
of hierarchy. Hierarchical models or algorithms allow the information to be inte- 
grated in a progressive and efficient way (especially in the case of multiresolution 
data, when images come into a hierarchy of scales) providing gains in terms of 
both computational efficiency and quality of results. 

Algorithm-based hierarchical approaches are usually related to well-known 
multigrid resolution techniques from Numerical Analysis, where an increasing 
sequence of nested spaces is explored in a number of possible ways. The partic- 
ular case of coarse-to-fine exploration has been successfully extended to discrete 
image models [3,8]. Within this framework, reduced versions of an original (spa- 
tial) model can be deduced in a consistent way (the form of the energy and 
associated parameters are deduced at once). The “stack” of models thus ob- 
tained can then be used for inference purposes, the estimate iteratively obtained 
at a given level being used as an initialization for the processing at the next 
level. 

On the other hand, model-based hierarchical approaches aim at defining a 
new global hierarchical model which has nothing to do with any original (spa- 
tial) model. It has to be manipulated as a whole, but according to procedures of 
reduced complexity. These models usually lie on the nodes of a quad-tree whose 
leaves fit the pixels of (maximum resolution) images [4,9,10]. In this case, the 
peculiar dependency structure, like in case of Markov chains, allows non-iterative 
inference procedures made of two sweeps: a bottom- up sweep propagating all in- 
formation to the root, and a top-down one which in turn allows optimal estimate 
to be obtained at each node given all the data. 

One of the drawbacks of these tree-based approaches lies in the structural 
constraints they impose: first of all they might appear artificial for certain types 
of problems or data; in any case the relevance of the inferred variables at coarsest 
levels is not obvious (especially at the root). Second, the complete tree-structure 
is cumbersome in case of large images. 

To circumvent this, here we propose a hierarchical model based on a “hybrid” 
structure which combines a spatial grid of reduced size at a coarser level with 
“sub-trees” appended below it, down to the finest level (see Fig.l). This paper 
investigates the use of the MAP estimator and of the MPM estimator, known 
as more reliable than MAP, on the hybrid structure. It should be noticed that 
MPM estimator relies on the computation of posterior marginals, which are a 
key ingredient for EM-type parameter estimation techniques. 
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The section 2 describes the hybrid structure and its associated energy func- 
tion. In the section 3, we explain the procedures to achieve the MAP and MPM 
estimates by mixing a non-iterative inference on sub-trees with iterative deter- 
ministic inference of reduced cost at the top of the structure. The section 4 
illustrates these procedures for image classification problems with synthetic and 
real images. 

2 Hierarchical Model and Energy Function 




Fig. 1. Two hierarchical structures: (a) quadtree with three levels; (b) truncated 
tree with two levels. 



The hierarchical model we use is based on a hybrid structure [6] . One example 
is shown in Fig. 1(b) for a single level below the coarsest grid. To describe this 
graph, we shall introduce some notations. 

First, we define the coarsest level as a rectangular grid with a Ist-order 
neighborhood. Then each site of initiates a quadtree, so that the grid 
(0 < n < N) made up by the nodes at the level n is 2^^ x 2^ times larger than 
S^. Now each site i of has four natural correspondents in (provided 

that i does not belong to the finest level 5'^), its children, forming site set i, 
and one natural correspondent in (provided that i does not belong to the 

coarsest level ), its parent, denoted as i. Finally, the site set forming the tree 
rooted at i is denoted i (Fig.l). Vectors x and y are now indexed by the nodes 

of^^Uto^"- 

Given this graphical structure consider an energy function of the following 
form: 



U{x,y)= F] Vij{xi,Xj) +'^Wi{xi,Xj) +'Y^li{xi,yi) , (1) 

<i,i>n so iD so iD s 

where < i^j > designates pairs of neighbors in 5'^, Vij and Wi are local func- 
tions capturing respectively the spatial prior and the hierarchical prior (they will 



86 



Annabelle Chardin and Patrick Perez 



usually encourage identity between neighbors and between parents and children, 
resp.), and k expresses the point-wise relation between the observed variable 
Hi ^ and the unknown one Xi. The MAP estimator for this model amounts to 
minimizing U{x,y) in (1) with respect to x for a given y. 

From a probabilistic point of view, the associated posterior distribution of 
(x, y) is: 

P(x|y) oc JJ gij{xi,Xj) x JJ fi{xi,Xj) x hi{xi,yi) . 

<i,j>aso^ ' ' iDSOj ' ^ " iDS'jj ' '' " 

=e:>^p^^Vij{xi,Xj)<} =exp^(g)Wi(a^i,a:ff)0 =exp%<S>li{xi,yi)<} 

( 2 ) 

The MPM estimator requires to deduce from this global posterior distribution, 
each of the local posterior distributions P{xi\y)^ for i e S. 



3 Semi-iterative Inferences 



3.1 MAP Computation 



The global minimization of U{x,y) w.r.t. x can be written in a way that dis- 
tinguishes variables on (interacting in a non-causal fashion) from the others 
(interacting in a tree-based causal fashion): 



min U{x,y) 

X 






{ Xij{xi,Xj)+'^k{xi,yi) 

<i,j>ns° iD50 



+ E 

iD S° 



min 



jU 




(3) 



Each tree-based minimization w.r.t. for i e in the left-hand-side can 

be conducted exactly, with fixed complexity per node, using an extension of the 
chain-based Viterbi algorithm [7]. The first (upward) sweep of this algorithm 
computes recursively the optimal value x^ (xj) at a node i given the value of its 
parent, and the associated value Vi{xj) for the energy term that concerns i: 



Vi (Xj) = min ^ [wj {xj , Xj) + Ij (xj ,yj)] (4) 

-n • 

- jD 1 

= min[M)i(xj, Xf) + h(xi, yi) + V Vj{xi)] (5) 

Xi ^ ^ 

^ In the following we consider that measurements are available at each node z G S', to 
take into account the case of multiresolution images. Derivations are readily adapted 
to the cases where data are only available on a subset of S'. In practice data are, for 
instance, often defined at the finest level only. In this paper our experimentations 
were led on monoresolution images which were either monospectral (synthetic cases) 
or multispectral (real satellite data). 
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and x°{xj) = a.rgmm[wi{xi,xj) +k{xi,yi) + y^V(a;j)]. (6) 

Xi ^ ^ 

jD i 

Once this recursion is completed it remains to perform in (3) the minimization 
w.r.t. X 50 , which amounts to 

min ^ Vij{xi,Xj) + '^[li{xi,yi) + '^Vj{xi)]. 

<i,i>D50 in 50 iDi 

Apart if is very small, an iterative ICM-type procedure [ 2 ] must be introduced 
here due to the non-causal term j>D 5 o Vij{xi,Xj). This provides estimates 
Xi , for i G from which all the other estimates are recursively recovered thanks 
to functions x^: Xj = x^{xj). The whole procedure goes as follows: 

Semi-iterative energy minimization 

▲ upward sweep 
Leaves (i G S^) 

I Vi{xi:) = minj;.[wi{xi,XY) +h{yi,Xi)] 
f x° (xj) = arg [wi {xi ,xj) + k{yi,Xi)] 

Recursion (for n = N — 1 . . .1, i € S'") 

( Vi{xj) = mm^.[wi{xi,xj) + li{xi,yi) + 

= aIgmm.^^[wi{x^,x^) + y*) + LjOi 

coarse ICM : update a few times all sites of in turn, for energy 
E<ij>D so Vij{xi,Xj) + ^ 0 [h{xi,yi) + EjD i (^i)] 

Xi^ Vi G 

T downward sweep 
Xi = x^ (xj) 

If an exact minimization can be obtained at the coarsest level (which is 
especially the case for the complete tree where 5'^ reduces to a single site), the 
final estimate x is exactly the global minimizer of U. Note that the functions 
Vi{xj), which appear in the upward sweep, progressively collect dependencies 
with respect to the data, even though this is not made explicit by abuse of 
notation: Vi{xj) (as well as x'^(xj)) actually depends on yi. This means that 
ICM at the level 0 and downward sweep provide inferences based on all data. 

When y is only attached to *5^, this procedure can be compared to the 
multigrid method in [ 8 ], where the non-hierarchical energy 

U^{x^,y)= E] h(xi,yi) (7) 

<i,j>US^ iUS^ 

is minimized within the set of configurations which are piece- wise constant over 
2 N^n ^ 2^^^^ blocks, for n = 0 . . . A^; equivalently 

f/"(x",y)= ^ ^ lj{x2,yj) 



(8) 
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is minimized, where x'^ is defined on S'^ . Let us note that thus defined for 
n = 0 in the multigrid approach, corresponds to the energy which is manipulated 
at the coarsest level of the semi-iterative approach for Wi{xi,xj) = f3[l — S{xi,XY)] 
with ^ -hoo (in this case the optimal configuration x is constant over each tree 
i, i e S°, and ^ Moreover, if data 

are only available at the finest level, the initialization of the multigrid coarse 
ICM is given by the minimization of each site i G 

while the initialization of the hybrid coarse ICM is given by the minimization 
of "^j[]iVj{xi) for each site i G 5'^. To confirm this statement, we will just 
have to compare the coarsest initializations and estimates provided by the two 
methods, taking the same number of levels, the same functions Vij and and 
Wi{xi^xj) = f3[l — S{xi^xj)] with f3 very large for the hybrid energy. 

3.2 MPM Computation 

In the case of a complete tree where reduces to a single site, the exact MPM 
estimates can be computed on each node [9] using an extension of Baum- Welch 
algorithm on a chain [1]. In the general case, the downward recursion is now 
based on the following relation: 

Vi 0 5°, P(a;i| 2 /) = ^ P{xi\xj,y)P{xj\y) , 



P{xi\xj,y) = P{xi\xj,y^ 

due to separation property. The use of this recursion requires that a previous 
upward sweep provides P{xi\xj,yi) for i ^ and P{xi\y) for i G S^. This is 
achieved by successivly summing out x^’s for all i ^ S^. The resursion is based 
on: 

P{xi\xj,y^) oc fi{xi,xj)hi{xi,yi) E n fj{xj,xj)hj{xj,yj) 

®iniiioiD irniO 

cx fi{xi,Xj)hi{xi,yi) HEP //c(^/C5 ^/g)^/c(^/C5 ^/c) 5 (9) 

jU i Xj kW j 

with 

F.(-.)^En fk(xk,x^)hk{xk,yk) 

Xj kU j 

= E n (10) 

Xj kU j 

This eventually provides the probability 

P(a:5o|2/)= E 

XgrsO 
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Because of the non-causal structure on *5^, P{xi\y) for i ^ have to be ap- 
proximated with the help of a Gibbs sampling of the distribution P{xso\y). This 
procedure provides approximated local posterior marginals in a semi-iterative 
way. If a MPM estimator is employed, an approximation of it is obtained as a 
by-product. 

Then the whole procedure goes as follows: 



Semi-iterative local posterior marginal computation and MPM estimation 

▲ upward sweep 
Leaves (i G S^) 

Recursion (for n = N — 1 . . .1^ i e S^) 

Ti {xj) = 'Y2xi ’ Xi)hi {xi ^yi) rijD i (^0 
◄ ► coarse posterior marginal computation: 



draw samples ^ 50 ( 1 ),..., xso{m) from: 

P{xso\y) oc n<ij>nso 9 ij{xi,Xj) x ^5° hi{xi,Vi) x 0,0 5° Ujai^jixi) 
approximation of P{xi\y) « E^fc+i S[xi{j),Xi] 



T downward sweep 

Recursion (for n = 1 ... i e S^) 
P{xi\y) = 

MPM at leaves 

Xi = argmin^, P{xi\y) 






Here, the same abuse of notation is made as in the MAP procedure: in fact 
the functions ¥i{xj) depend on so that the progressive summations out x^’s 
for all i ^ are made with respect to data associated to i and the downward 
sweep provides local posterior marginal based on all data. 



4 Supervised Classification Comparisons 

To demonstrate the practicability and the relevance of the approach for discrete 
low-level image analysis, we first report comparative experiments for supervised 
classification. To this end, we considered a Potts-type prior with potentials 

Vij{xi,Xj) = 2^a[l - S(xi,Xj)] , (11) 



Wi{xi,xj) = (3[1 - 5{xi,xj)] , 



along with Gaussian likelihoods 



k{xi = k,yi) 



^ k 

0 otherwise. 



( 12 ) 



(13) 
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Comparative experiments were led for N G {0, 2, 3, 8} for the synthetic scene 
and for N G {0, 3, 4, 8} for the real spot scene. For = 0 the inferences are based 
on standard non-hierarchical models using iterative ICM for the approximation 
of the MAP or an iterative Gibbs sampler for the approximation of the MPM. 
While N = 8^ when the size of is 2^ x 2^, corresponds to the complete tree 
{\S^\ = 1) allowing an exact non-iterative computations. = 2,3,4 correspond 
to three-, four- and five- level hybrid structures on which semi- iterative MAP and 
MPM inference are performed. 



4.1 Synthetic Images 

First, the experiments were carried out on a 256 x 256 synthetic image involving 
5 classes (Fig. 2). We applied an additive Gaussian white noise with a differ- 
ent standard deviation for each class, thus the gray level means and variances 
are known. 




synthetic image moderate noise larger noise 



Fig. 2. Synthetic data. 



First we stress on the MAP procedure to support experimentally the state- 
ment given in the section 3, according to which the semi- iterative MAP inference 
with data at the finest level only tends to constrained non-hierarchical minimiza- 
tion when P + 00 . It is easy to see (Fig. 3) that both the initializations and 
the results of the coarse IGM for the multigrid method and the semi-iterative 
one are the same as soon as P is large enough (greater than 80). Accordingly, 
the multigrid approach and the semi- iterative procedure with ^ +oo really 
behave in the same way. 

Secondly, we worked on a moderately corrupted and a fairly corrupted im- 
ages, and then paid attention to how our algorithms behave for a noisier image 
(see Fig. 2). The obtained classification results are shown in Fig. 4 and in Fig. 5 
with their respective percentages of misclassification and cpu times in seconds 
(on a 170 MHz Ultra Sparc I workstation) in tables 1 and 2. 

As can be seen from the figures 4 and 5, the hierarchical models provide good 
results. As the noise level is low, the resulting classifications of all the methods 
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(al) (a2) 

coarsest- level initialization 







m ^ 


m » 




”■1 


(bl) 


(b2) 



coarsest-level output 



Fig. 3. comparative results at the coarsest level between the multigrid method 
(al-bl) and the semi- iterative method (a2-b2) with N = 2 and [3 +oo. 



are of an almost equivalent quality. Whereas the degradation of the synthetic 
image increases, the hierarchical models and the multigrid approach seem more 
robust to noise than the plain non-hierarchical iterative procedures. Moreover, 
it can be noticed that the non- or semi-iterative algorithms take much fewer 
cpu time: for the MAP, they take three times less cpu time than the ICM, and 
for the MPM, more than forty times less cpu time than the Gibbs sampler. For 
the estimation of the posterior marginals in the sampling procedure, we fit the 
number of retained samples (m — /c, see the description of the algorithm) to the 
size of the concerned grid (200 for the complete grid and 20 for the coarse grid 
of the hybrid structure). 

To go further in the comparisons, we can focus on the results for the complete 
tree {N = 8) and for the two examples of the hybrid structure for each estimator. 
First, if we just look at the tables, we can say that the semi-iterative estimation 
provides slightly better classification than a non-iterative one for a comparable 
cpu time. Moreover, the MPM classifications are slightly better than the MAP 
ones, especially for A" = 2, but they enquire more calculations because the 
MAP downward recursion is simplier (one has only to read look-up tables built 
in the upward recursion, whereas there are some calculations during the MPM 
downward step). Anyway, whatever estimator is used, the use of a coarse iterative 
procedure on top of the hybrid structures does not seem to imply an extra 
computational load, while improving the results. 

As far as the visual aspect is concerned, the semi-iterative classifications still 
reveal a blocky aspect (as observed in all tree-based approaches). But these 
artifacts are less and less pronounced as the number of levels in the hybrid 
structure decreases. 

On top of the fact that the MPM is doing better than the MAP, the hierar- 
chical MPM provide a measure of relative confidence associated to the estimated 
value at each site, through the entropy 

Ci = ^P{xi\y) \ogP{xi\y). 

Xi 

These measures (Fig. 6) allow us to better appreciate the quality of the obtained 
estimates and to use them in consequence. 
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MAP multigrid MAP N = 0 MPM N = 0 




map N = 2 MPM N = 2 




map iV = 3 MPM iV = 3 




map N = S MPM N = S 



Fig. 4. Comparative results for the classification problem with a moderately 
corrupted image. 
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map iV = 3 MPM iV = 3 




map N = S MPM N = S 



Fig. 5. Comparative results for the classification problem with a more corrupted 
image. 
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Model 


MAP 

misclassification 


cpu time 


MPM 

misclassification 


cpu time 


iterative N = 0 


5.3% 


11s 


7% 


7min 


semi-iterative N — 2 


4.15% 


3.6s 


3.6% 


10.5s 


semi-iterative N = 3 


4.2% 


3.6s 


4.5% 


9.3s 


non-iterative N = 8 


4.8% 


3.6s 


4.7% 


9s 


multigrid 


5.9% 


11.6s 







Table 1. Performances of the different algorithms with the synthetic image in 
fig-4 



Model 


MAP 

misclassification 


cpu time 


MPM 

misclassification 


cpu time 


iterative N = 0 


9.65% 


11s 


10% 


7min 


semi-iterative N — 2 


6.6% 


3.6s 


6.2% 


10.5s 


semi-iterative N — 3 


7.9% 


3.6s 


7.9% 


10s 


non-iterative N — 8 


8% 


3.6s 


8% 


9s 


multigrid 


7.3% 


11.6s 







Table 2. Performances of the different algorithms with the synthetic image in 
fig-5 



(a) 

N = 2 N = 3 N = 8 



(b) 

Fig. 6. Confidence maps associated to classification (a) of the moderately noisy 
image and (b) of the noisier image. 
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4.2 Real Satellite Images 




Fig. 7 . 512 X 512 Spot images (courtesy of Costel, University of Rennes 2 and 
GSTB). 



The previous algorithms were applied to SPOT satellite images provided by 
the Costel laboratory (University of Rennes 2) in the context of remote sensing 
researches. The scene (Fig. 7) is composed of three 512 x 512 images with dif- 
ferent wavelenghts in the visible spectrum and represents the Bay of Lannion 
in France in December 1996. The goal of this study was to determine the land 
cover of this area. So as to reach this aim, the geographers of the Costel labora- 
tory built a list of eight classification categories: Sea and water, Sand and bare 
soil. Urban area. Forests and heath. Temporary meadows. Permanent meadows. 
Colza, Vegetables. 

Thanks to both tests on the lands and photointerpretations, they were also 
able to provide samples of these eight categories on the three SPOT images of the 
scene. Some of them were used to learn the gray levels means and the variances of 
each category for each image, so that we could perform supervised classifications, 
and the left samples were kept to assess the accuracy of the classifications. 

As for the model, we considered the three channels as independant. As a 
consequence, the form of the relation between the observed variables (here the 
multispectral scene) and the unknown variables (class label at each pixel) be- 
comes: 



c=3 



k{xi = k,y^ c e {1,2,3}) = F] +log(<^fcc)) li i & 



C=1 



2al 

t\jc 



(14) 



where {jikc i ^kc ) level mean and variance of the class k within 

channel number c G {1,2,3}. 

The eight algorithms provide quite similar results of a good quality (see Fig. 8 
and 9, and Tab. 3). About 92% of the pixels of the samples are well classified. The 
three hierarchical MAP inferences are achieved in almost twice less cpu time than 
the iterative ICM. In comparison with the iterative MPM which takes more than 
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one hour to be performed, the three hierarchical MPM inferences need reasonable 
cpu time. This is encouraging for the EM-type parameter estimation algorithms 
for which the form of one recursion step is close to the MPM algorithm form. 




iV = 4 N = 8 



Fig. 8. Comparative results for the classification problem with multispectral 
spot images and the MAP estimator (see Fig. 10 for the legend). 



5 Conclusion and Extensions 

In this paper, we presented a hybrid hierarchical structure which is an interest- 
ing compromise between standard spatial models and hierarchical models based 
on a complete quadtree. We introduce two inference algorithms devoted to this 
new structure: the first one computes the MAP estimate and the second one 
computes the local posterior marginals and the MPM estimate. 
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N = 4 N = 8 



Fig. 9. Comparative results for the classification problem with multispectral 
spot images and the MPM estimator (see Fig. 10 for the legend). 




Vegetables 
Sea and watei’ 
PecnnaDent meadows 
Temporal^ meadows 
Fbiiests and heatfi 
Colza 
Uiban aiea 
Sand and bate soil 



Fig. 10. Legend for the classifications shown in Fig. 8 and 9 
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Model 


MAP 

misclassification 


cpu time 


MPM 

misclassification 


cpu time 


iterative V = 0 


8% 


150s 


8% 


IhlOmin 


semi-iterative V = 3 


7.5% 


85s 


7.5% 


165s 


semi-iterative V = 4 


7.5% 


85s 


7.5% 


165s 


non-iterative V = 8 


8% 


65s 


7.8% 


160s 



Table 3. Performances of the different algorithms with the real multispectral 
images in fig. 7. 

With this structure we now plan to deal with the critical problem of parame- 
ter estimation: using marginal computation introduced here, we should be able to 
design specific EM-like algorithms on the hybrid structure, as already done with 
trees [4,9] and with spatial grids [5]. In the classification problem, this will con- 
cern both the data parameters (number of classes, gray level means and variances 
of each class) whose automatic estimation will allow unsupervised classification, 
and the parameters of the prior model (a, ( 3 ). We also plan to address the issue 
of automatically estimating the optimal number of levels in the struture. 
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Abstract. In this paper we investigate several dynamics to optimize a 
posterior distribution defined to solve segmentation problems. We first 
consider the Metropolis and the Kawasaki dynamics. We also compare 
the associated Bayesian cost functions. The Kawasaki dynamic appears 
to provide better results but requires the exact values of the class ratios. 
Therefore, we define alternative dynamics which conserve the properties 
of the Kawasaki dynamic and require only an estimation of the class 
ratios. We show on synthetic data that these new dynamics can improve 
the segmentation results by incorporating some information on the class 
ratios. Results are compared using a Potts model as prior distribution. 



1 Introduction 

Since one decade, Gibbs fields are widely used in image processing especially 
for image segmentation [1]. The segmented image is given by a ground state 
of the model composed of a prior model (the Potts model for instance) and 
a likelihood term depending on the data. This latter term can be interpreted 
as an inhomogeneous external field. The ground state is obtained by sampling 
the model at zero temperature using a decreasing temperature parameter in a 
simulated annealing scheme [2] . The Metropolis dynamic or the Glauber dynamic 
are usually used in the simulated annealing algorithm. These dynamics consist 
in sampling the model in the canonical ensemble, i.e. in the whole configuration 
space. 

In case of severe noise, the external field (given by the data) is less reli- 
able. Therefore, the prior model (the interactions) have a greater influence in 
the results. Artifacts due to this prior can then appear in the segmented im- 
age. Without external field the Potts model presents a phase transition at low 
temperature [3]. In a typical sample, one label prevails and at low tempera- 
ture, a uniform configuration is obtained. With a uniform external field this 

* This work was partially supported by the French- Russian Institute A.M. Liapunov 
(Grant 98-02) 



E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 99-114, 1999. 
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phenomenum does not exist. However, for inhomogeneous external fields, phase 
transitions may appear, as shown for the Ising model with a chess-board external 
field in [4] . In an image segmentation framework, this leads to over-regularization 
if the signal to noise ratio is too low. Fine objects are erased during the segmen- 
tation scheme and big objects tend to spread in their environment in case of 
severe noise. 

The main idea of this paper is to avoid this effect by sampling the model 
in a micro-canonical ensemble, i.e. a subspace of the configuration space which 
exclude undesired configurations. In this paper, we first consider that the pro- 
portions between the different labels are fixed. The Kawasaki dynamic allows us 
to derive such a sampling [5]. With this dynamic we can expect to preserve ob- 
jects because an unique label cannot fill the whole volume. However, in practice 
the ratio between the different classes is unknown. Therefore, we first have to 
estimate the class ratios. To take into account the bias of the estimates, we have 
to modify the Kawasaki dynamic. 

Herein we study different dynamics for image segmentation. We first compare 
Metropolis and Kawasaki dynamics. Embedded in a Bayesian framework, these 
dynamics correspond to two different cost functions. To tackle the problem of the 
error in the class ratios estimation, we define a new Bayesian risk function leading 
to optimize the model in a “soft” micro-canonical ensemble. This cost function 
allows configurations having class ratios slightly different from the estimated 
one. We then study a mixture of Metropolis and Kawasaki dynamics. Finally, 
we compare the results obtained with the different dynamics on a synthetic image 
corrupted by a severe Gaussian noise. 

2 Metropolis and Kawasaki Dynamics 

2.1 Background and Notations 

Herein we consider the problem of image segmentation using a Potts model. 
Denote X = {xg} the data on the lattice S = { 5 } and Y = {ys} the segmented 



The segmentation problem consists in optimizing the a posteriori probability 
P{Y\X) using the Bayesian rule which gives: 



image. 



P{Y\X) (xP{X\Y)P{Y). 

P{X\Y) refers to the likelihood and P{Y) to the prior. 

We consider the pixels to be conditionally independent, i.e.: 



( 1 ) 



P{X\Y) = l[p{x,\y,). 



( 2 ) 



sD S 

The different classes are supposed to be Gaussian, i.e.: 




(3) 
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where I indexes the different classes and are the mean and variance of 

the corresponding Gaussian distribution. A{a,b) is equal to 1 if a = 6 and to 0 
otherwise. 

We consider a prior defined by a 8-connected Potts model [6,7,8], written as 
follows: 

-P(r) = ^exp- (3{l- A{ys,ysa)), (4) 

<S,S^> 

where < > defines a clique, i.e. is an element of the set: 



C = {< 5, >e S X S : d{s, G {1, V^}}. (5) 

Therefore the posterior distribution can be written as a Gibbs distribution: 



P{Y\X) (X exp - [U{X\Y) + U{Y )] , 
and the local conditional probabilities are written as follows: 



(6) 



p{ys\ytp7^ s,Xs) = p{ys\ytP ^ A4,Xs) oc exp-U{ys\Afs), (7) 



where Afs is the neighborhood of s (pixels in interactions with s) and the local 
conditional energy U{ys\Afs) is written: 



U{ys\Afs) ^(ys,yt)) - Y 

tU A/'s ZD A 



{Xs-mf 1, /o 2 n 

+ -lc.g(2,„,) 



^{Ps 5 0- 

(8) 



To optimize the a posteriori distribution P{Y\X) we consider the Bayesian 
framework. The optimum is the configuration which minimizes the Bayes risk: 

Y = arg min [ R{Y, Y^)P{Y)dY, (9) 

Jn=As 

where i?(T, Y^) is the cost function. 



2.2 Algorithms and Associated Bayesian Cost Functions 

The most used Bayesian criterion for image segmentation purpose is the Maxi- 
mum A Posteriori (MAP) criterion which maximizes the a posteriori probability: 

Ymap = ar^nmxPy^), (10) 

which corresponds to the following cost function: 

i?i(F,y°) = i-zi(y,F°), (11) 

i.e. i^i(T, Y^) equal to 0 if T = Y^ and to 1 otherwise. 

To obtain the MAP criterion, we run a simulated annealing scheme using a 
Metropolis dynamic [2]: 
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Algorithm 1 



1 Initialize a random conGguration Y = {ys), set T = Tq 

2 For each site s: 

2. a Choose a random value new different from the current value cur 
2.b Compute the local energies U{ys = cur\N's) and U{ys = new\JYs) 

2.C IfU(ys = new\Afs) <U{ys = cur\Afs) set ys to new, otherwise set ys to 
new with probability exp — [^~] where 
AU = U{ys = newlfCs) -U{ys = cur\N's). 

3 If the stopping criterion is not reached decrease T and go to 2 



To adhere to the theoretical properties of convergence we should consider a 
logarithmic decrease of the temperature. However, to obtain a faster algorithm 
we have considered a linear decrease of rate 0.95. For the kind of energy functions 
we use, this decrease rate is slow enough to achieve a good solution. 

The aim of the prior model is to regularize the solution, i.e. to avoid noise 
in the segmented image. In case of severe noise or when two classes are strongly 
mixed, the prior becomes preponderating the likelihood that leads to unreliable 
results. This may introduce some artifact due to the prior in the segmented 
image. Among them is the loss of small objects [9,10]. To overcome this problem 
we propose to add an information which consists in keeping the proportionality 
ratios of the different classes. In that way, the volume (number of pixels) of each 
class will be constant during the optimization. In this section, we suppose that 
these ratios are known. 

We consider the criterion defined by the following cost function: 

i? 2 (V F°) = 1 - Z\(F, y°)Z\( 7 ° 7 o), (12) 



where is a vector containing the proportionality coefficients of the different 

1 • J.1 J2 j.- -\^U / []/*\ number of pixels sueh that y^^=i\ i 

classes m the connguration Y iy ii) = . . / — r t — ^ and 70 

represents the class ratios of the expected solution which are supposed to be 
known. Notice that the defined function is symmetric as (T = Y^) ^ (7 = 7 ^). 

To obtain the associated Bayesian criterion we run a simulated annealing 
algorithm using the Kawasaki dynamic. The Metropolis dynamic is based on a 
spin-flip procedure which consists in changing the spin value of a site (the label 
of a pixel). The Kawasaki dynamic is based on a spin-exchange procedure which 
consists in exchanging the spin values of two sites. The induced algorithm is 
written as follows: 



Algorithm 2 



1 Initialize a random configuration Y = (ys) with j = jq, set T = Tq 

2 During N iterations: 

2 . a Choose randomly two sites s and 5 ^, denote cuvs (resp. cuvs^ ) the current 
value of ys (resp. ys^ ) 

2.b Compute the local energies U (ys = curs\Afs), U(ys^ = curs^\Afs^), U(ys = 
curs^\Ms) hind U(ys^ = curs\N's^) 
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2.C IfU{ys = cur spin's) + U{ys^ = cWslArsn) < U{ys = curs\Afs) + = 

ci4rsD|A4n) set ys to curs^ and ys^ to cuvs, otherwise set ys to cuvs^ and 
ys^ to cuVs with probability exjp -[^-] where 
AU = U{ys = cuVs^lAfs) -bU{ys^ = curs\Afs^) 

-U{ys = cuVslAfs) ^U{ys^ = curs^\Afs^). 

3 If the stopping criterion is not reached decrease T and go to 2 



2.3 Experiments 

We consider the synthetic image shown on figure l.a. A Gaussian noise is added 
to this image. We consider severe noise corresponding to a Signal to Noise Ratio 
(SNR) equal to —l.bdB (figure l.b) and —lldB (figure l.c). A low value of the 
prior distribution parameter [3 leads to noisy results for both the Metropolis (fig- 
ures 2. a and 5. a) and the Kawasaki dynamic (figures 2.d and 5.d). However, the 
results obtained with the Kawasaki dynamic are closer to the expected results. 
When increasing the value of j3 the regularization is increased. For high values of 
f3 the segmentation tends to “forget” the data and the prior distribution induces 
artifacts, especially for low SNR (figures 5.c and 5.f). A good compromise must 
be found. For low SNR, such a compromise cannot be found with the Metropolis 
dynamic. Imposing the class ratios avoids the mixing between objects. We can 
notice that, in the Kawasaki dynamic experiments, some isolated pixels are still 
misclassified. In fact, to have a fair comparison we have set the same number of 
iterations for all experiments. The convergence with the Kawasaki dynamic is 
slower than with the Metropolis dynamic. 

3 A New Bayesian Cost Function 

3.1 Algorithm 

From the previous section, it appears that fixing the ratios between the classes 
improves the segmentation especially in case of low signal to noise ratio. However, 
in practice these parameters are not known. Therefore we propose to estimate 
them in a first step. Then, we cannot use the Kawasaki dynamic as it stands 
because we only have some estimates of the class ratios. Some deviations of the 
class ratios from these estimates must be allowed. Therefore, we define a new 
Bayesian cost function which penalize configurations with class ratios different 
from the estimates. However, this penalty is not “hard” in the sense that small 
deviations from the initial estimates are not highly penalized. 

The proposed criterion is defined by the following cost function: 

Rs{Y,Y^) = l-A{Y,Y^)f{\y-io\\)f{\\l-io\\), (13) 

where 70 represents the estimated class ratios and /(.) is a decreasing function 
on [0, 00 ) taking its values in [0, 1] with maximum at 0. The Kawasaki dynamic 
is obtained for /(|| 7 °- 7 o||) = ^( 7 ° 7 o)- 
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To optimize the posterior distribution with respect to this criterion we have 
to compute: 

=arg min J^Rs{Y°,Y)P{Y\X)dY, (14) 

= arg imn Jjl- A{Y, F V(l |7° - 7o| |)/(| |7 - 7 o| |)] P{Y\X)dY, 

= arff inax/2 (I |7°- 70 1 |)P(r°|V), 

= arg nmx 4 exp [- C/( V | F °) - f7(F°) + log /^ ( 1 1 7 ° - 70 1 1 )] , 

= argnmxexp- [U{X\Y^) + U{Y^) - 2 * log/(|| 7 °- 7 o||)]. 

Therefore, the defined criterion is equivalent to the MAP criterion using the 
prior: 

P(F°) oc exp - [U{Y°) - 2 * log - 70 I |)] • (15) 

This new Bayesian cost function is equivalent to add a prior which favorizes 
the configurations having class ratios close to the estimated one. To reach this 
criterion, we can use a simulated annealing scheme with a Metropolis dynamic 
on this new prior: 

Algorithm 3 

1 Estimate the class ratios 70 

2 Initialize a random configuration Y = (ys), set T = Tq 

3 For each site s: 

3. a Choose a random value new different from the current value cur 
3.b Compute the local energies U{ys = cur\Afs) and U{ys = new\Afs) and 
the corresponding class ratios jcur and jnew 
3.C If U{ys = new\Afs) - 2* log/(||7„e,„ - 7o||) < U{ys = cur\Ms) - 2* 
log /(I |7cur ~7o| I) set ys to new, otherwise set ys to new with probability 
exp— where 

AU = U{ys= new\JYs) - 2 * log /(| - 7o||) 

-U{ys = cur\Afs) + 2 * log /(| |7cnr - foil). 

4 If the stopping criterion is not reached decrease T and go to 3 

For the simulations, we have considered the following function: 

/(II 7 - 70 II) oc exp-kL_^. (16) 

In this paper, we have considered that the mean and variance of each class 
are known. The different classes are supposed to be Gaussian. Therefore the 
histogram of the data is written as follows: 
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Kx) = X]7o(0^ 

a A 



: exp - 



{x - 



2af 



(17) 



Denote the vectors H = F = (70 (0) matrix P = (pij) where 



Pij 



7^== exp — . Then we have: 
/ 2 ^ ^ 



P = PP (18) 

The class ratios are then estimated by a singular value decomposition computing: 

r = HP^\ (19) 



Remark that, if we do not know the mean and variance of the classes, we can fit a 
sum of Gaussian to the histogram using the Levenberg-Marquardt method [11]. 



3.2 Experiments 

For the first experiment (SNR = — 7.5dP), the modified Bayesian cost function 
leads to results very similar to those obtained with the Kawasaki dynamic, for 
both real (figures 3. a and 3.b) and estimated (figures 3.d and 3.e) class ratios. We 
can remark that here the convergence is as fast as with the Metropolis dynamic 
and we do not get misclassified isolated pixels. With a SNR equal to — lldP, 
small deviations from the class ratios produce some perturbations on the results 
even with the real class ratios (figures 6.b and 6.c). However, the results are still 
much better than with the original Metropolis dynamic. 

4 Mixing the Metropolis and the Kawasaki Dynamics 

4.1 Algorithm 

Finally, we propose to mix the Metropolis and the Kawasaki dynamics. Using 
a simulated annealing scheme we compute alternatively long sequences of it- 
erations using the Kawasaki dynamics and short sequences of the Metropolis 
dynamic. The Kawasaki dynamic sequences allows us to optimize the a posteri- 
ori probability at constant class ratios. The short Metropolis dynamic sequences 
provide some variations from the estimated class ratios. The algorithm is written 
as follows: 

Algorithm 4 

1 Estimate the class ratios 70 

2 Initialize a random configuration Y = (Vs) with 7 = 70, set T = Tq 

3 During aN iterations: 

3. a Choose randomly two sites s and 5^, denote cuvs (resp. cuvs^) the current 
value of Us (resp. i/s^ ) 
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3.b Compute the local energies U{i/s = curs\J\fs), U{ys^ = cuVs^\JVs^), U{ys = 
cur spin's) and U{ys^ = curs\N's^) 

3. C IfU{ys = curs^\Afs) + U{ys^ = curs\Afs^) < U{ys = curs\Afs) + U{ys^ = 

cuVs^lAfs^) set ys to cuvs^ and ys^ to cuvs, otherwise set ys to cuvs^ and 
ys^ to cuTs with probability exp —[^^] where 
AU = U{ys = cuVs^lAfs) ^U{ys^ = curs\Afs^) 

-U{ys = cuTs\Afs) ^U{ys^ = curs^\J\fs^). 

4 During {1 — a) N iterations: 

4. a Choose randomly a site s and a random value new different from the 

current value cur 

4.b Compute the local energies U{ys = cur\Afs) and U{ys = new\Afs) 

4.C IfU(ys = new\Afs) <U{ys = cur\Afs) set ys to new, otherwise set ys to 
new with probability exp — [^^] where 
AU = U{ys = new\J\fs) -U{ys = cur\J\fs). 

5 If the stopping criterion is not reached decrease T and go to 3 



4.2 Experiments 

The mixing between the Metropolis and the Kawasaki dynamics leads to results 
similar to those obtained with the new Bayesian cost function defined in the 
previous section, especially for a SNR equal to —l.bdB (figures 4.b, 4.c, 4.e 
and 4.f). For a SNR equal to —lldB, the results are better with the real class 
ratios (figures 7.b and 7.c) but they are less robust with respect to the bias in 
the class ratios estimation (figures 7.e and 7.f). 

5 Conclusion 

Markov Random Fields (Gibbs Fields) are widely used in image segmentation. 
Using a Bayesian framework allows us to include prior information on the ex- 
pected result which turns to regularize the solution. However, in case of severe 
noise or low SNRs, these models show their limits. One field of investigation is 
to defined more efficient prior distributions [12,9,13]. In this paper, we have in- 
vestigated another aspect of the problem. Once the model is defined, we have to 
define a Bayesian criterion and a dynamic to optimize the posterior distribution 
with respect to this criterion. The most used criteria are the MAP and the MPM 
which can be reached using a Metropolis dynamic. We have shown the limit of the 
Potts model within these dynamic and the improvement which can be expected 
by using a Kawasaki dynamic. This dynamic considers configurations with given 
class ratios. Because the class ratios are not known in a practical application, 
we have to use an estimation of these quantities. Therefore, we have proposed 
some adaptations of the Kawasaki dynamic to tackle the problem of the bias in 
the class ratios estimation. The defined Bayesian cost function appears to be a 
tool to incorporate some prior knowledge on the expected result. 

The first approach consists in generalizing the Bayesian cost function asso- 
ciated with the Kawasaki dynamics. The derivation of an optimal cost function 



Metropolis vs Kawasaky Dynamic 107 



is an open problem. The second approach consists in mixing the Metropolis and 
the Kawasaki dynamics. Here also, the optimal mixing is an open problem. To 
compute a MAP solution, i.e. using a Glauber dynamic, either for the Potts 
model or for the modified prior, 20 mn is required on an UltraSO station. For 
a pure Kawasaki dynamic the CPU time becomes 80mn whereas it is 30mn for 
the mixing between Glauber and Kawasaki dynamics. Moreover, we can largely 
improve the CPU time required for the Glauber dynamic if we use a scanning of 
the image instead of randomly select a pixel. From this point of view, the new 
prior proposed here is more attractive than the mixture between Kawasaki and 
Glauber dynamics. 

We have shown the relevance of these approaches on synthetic data. We are 
currently investigating some applications on real data such as SAR or sonar 
images which are characterized by a strong noise. Besides, this work motivates 
resear chs on accurate estimators of the class ratios. We currently investigate the 
EM algorithm. 
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a:/3 = 0.2 b:/3 = 0.6 c: /3 = 1.0 




d:f3 = 0.2 e:/3 = 0.6 f: /3 = 1.0 

Fig. 2. Results with the Metropolis dynamic (a,b,c) and the Kawasaki dynamic 
(d,e,f) on figure l.b (SNR = -7.5dB) 
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a:/3 = 0.2 b:/3 = 0.6 c: /3 = 1.0 




d:f3 = 0.2 e:/3 = 0.6 f: /3 = 1.0 

Fig. 3. Results with the new Bayesian cost function with the real classes ratio 
(a,b,c) and estimated class ratio (d,e,f) on figure l.b (SNR = -7.5dB) 
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Fig. 4. Results by mixing Metropolis and Kawasaki dynamics with the real 
classes ratio (a,b,c) and estimated class ratio (d,e,f) on figure l.b (SNR = - 
7.5dB) 
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a:/3 = 0.2 b:/3 = 0.6 c: /3 = 1.0 




d:f3 = 0.2 e:/3 = 0.6 f: /3 = 1.0 

Fig. 5. Results with the Metropolis dynamic (a,b,c) and the Kawasaki dynamic 
(d,e,f) on figure l.c (SNR = -lldB) 
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d: 13 = 0.2 



e: 13 = 0.6 i-. 13= 1.0 



Fig. 6. Results with the new Bayesian cost function with the real classes ratio 
(a,b,c) and estimated class ratio (d,e,f) on figure l.c (SNR = -lldB) 
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Fig. 7 . Results by mixing Metropolis and Kawasaki dynamics with the real 
classes ratio (a,b,c) and estimated class ratio (d,e,f) on figure l.c (SNR = -lldB) 
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Abstract. Satellite images can be corrupted by an optical blur and elec- 
tronic noise. Blurring is modeled by convolution, with a known linear op- 
erator if, and the noise is supposed to be additive, white and Gaussian, 
with a known variance. The recovery problem is ill-posed and therefore 
must be regularized. Herein, we use a regularization model which intro- 
duces a (yi?- function, avoiding noise amplihcation while preserving image 
discontinuities (i.e. edges) of the restored image. This model involves 
two hyperparameters. Our goal is to estimate the optimal parameters in 
order to reconstruct images automatically. 

In this paper, we propose to use the Maximum Likelihood estimator, 
applied to the observed image. To evaluate the derivatives of this crite- 
rion, we must estimate expectations by sampling (samples are extracted 
from a Markov chain) . These samples are images whose probability takes 
into account the convolution operator. Thus, it is very difficult to obtain 
them directly by using a standard sampler. We have developed a new al- 
gorithm for sampling, using an auxiliary variable based on Geman-Yang 
algorithm, and a cosine transform. We also present a new reconstruc- 
tion method based on this sampling algorithm. We detail the Markov 
Ghain Monte Garlo Maximum Likelihood (MGMGML) algorithm which 
ables to simultaneously estimate the parameters, and to reconstruct the 
corrupted image. 



1 Introduction 

The problem of reconstructing an image X from noisy and blurred data Y is 
ill-posed in the sense of Hadamard UH|. Knowing the degradation model is not 
sufficient to obtain satisfying results : it is necessary to regularize the solution 
by introducing a priori constraints |iU|. Y is defined by Y = HX + N. When 
the blur operator H (corresponding to the Point Spread Function h) and the 
variance of the Gaussian additive noise N are known, the Maximum Likeli- 
hood (ML) estimate of X consists in searching for the image which minimizes 
the energy Uo{X) = \ \Y — HX\\‘^ /2a‘^ . The regularization constraint is expressed 

* This work has been conducted in relation with the GdR ISIS (GNRS). 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 115 ^!^ 1999. 

© Springer- Verlag Berlin Heidelberg 1999 
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through a function ^{X) added to [/q, which represents a roughness penalty on 
X. This function could be quadratic (as suggested by Tikhonov in PH) , as- 
suming that images are globally smooth, but it yields oversmooth solutions. A 
more efficient image model assumes that only homogeneous regions are smooth, 
and edges must remain sharp. To get this edge-preserving regularization, we use 
a non-quadratie (^-function, as introduced in and HH Properties of the ip- 
function have been studied in a variational approach in order to preserve the 
edges, avoiding noise amplification. We use a convex (^-function, ensuring the 
uniqueness of the solution, so that restoration is made by a deterministic mini- 
mization algorithm. 

The non-quadratic variational regularization model involves two hyperpa- 
rameters. The smooth properties of reconstructed images depend strongly on 
their value and therefore they must be accurately determined, excluding an em- 
pirical estimation. We use a stochastic approach, based on the Maximum Like- 
lihood estimator. The regularizing model is interpreted as a Markov Random 
Field (MRF). There are two difficulties with the estimation: 

— The probability to be maximized is non concave. 

— To compute the derivative of the likelihood function according to parameters, 
we need to sample from the prior and posterior image densities. Due to the 
convolution, it is impossible to directly sample images from the posterior 
law. 

We propose in this paper to use a stochastic method to estimate the parameters, 
so we first make a statistical interpretation of the regularization contstraint, in a 
Bayesian framework. Then we explain why we choose the Maximum Likelihood 
estimator, and we focus on its gradient computation. We use a Monte Carlo 
method, which needs to sample images from the prior and posterior laws. Sam- 
pling is achieved by transforming these laws into Gaussian laws : this is possible 
by using a half quadratic extension of the (/:?-function as Geman & Yang HH, 
and by working in the cosine transform (DCT) space. We study some problems 
raised by sampling. Then we detail the MCMCML (Markov Chain Monte Carlo 
Maximum Likelihood) estimation algorithm, and study its convergence. 

2 Problem statement 

In this paper, Y, Y and N are vectors of dimension {NxNy), made from the 
pixels of an image in a lexicographic order. We denote Xij the value of the pixel 
at column i and line j. 

The degradation model is represented by the equation Y = HX -\- N, where 
Y is the observed data, and X the original image. N is the additive noise and is 
supposed to be Gaussian, white and stationary. H is the convolution operator. 
The PSF h is positive, symmetric with respect to lines and columns, and verifies 
the Shannon property. iL is a block-circulant matrix generated by h. 

The noise standard deviation a and the PSF h are known (see figEJ. 
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The regularized solution is computed by minimizing the energy 

U{X) = \\Y -HX\\/2a‘^ + ${X) (1) 

where $ is defined by ^{X) = )? ^ | ^ ip 

A and S are the hyperpammeters^ and Dy are first derivative discrete operators 
on columns and lines, (f is called a potential function, which has to exhibit bound- 
ary preserving properties jS]. is a symmetric, positive and increasing function, 
with a quadratic behaviour near 0 to smooth isotropically homogeneous areas 
and linear or sub-linear behaviour near oo to preserve high gradients (edges). 
We use the “hyper surfaces” convex function (p{u) = 2vl + i? — 2 0. 




2.1 Hyperparameter choice 

The quality of the reconstructed solution depends on the hyperparameter choice. 
A is a parameter which weights the regularization term versus the data term. Too 
high values of A yield oversmooth solutions, and too small values lead to noisy 
images. The S parameter is related to a threshold below which the gradients (due 
to the noise) are smoothed, and above which they are preserved. A high value of 
this threshold filters the edges as well as the noise, that yields over-regularized 
images. On the other hand, a small 6 value provides insufficient noise filtering. 

An empirical hyperparameter choice is very difficult, because there are two 
degrees of freedom. That is why we propose to do an automatic estimation to 
choose the optimal hyperparameters. 



2.2 Bayesian interpretation of restoration problems 

We propose to use a statistical estimator to choose the optimal hyperparameters. 
So we need a stochastic interpretation of the criterion defined by equation 0. 
In a Bayesian framework, the posterior probability of an image X observing the 
data Y is given by 



P{X I Y) = P{Y I X) P{X) / P(T) (2) 



where P{Y) is a constant w.r.t. X. In this paper we write P{X) instead of 
P{x = X), where x denotes the multidimensional {N^Ny) random variable. To 
evaluate equation ( 0 , we first have to express the probability of obtaining a 
corrupted data Y from an original image X. P(Y \X) follows the distribution 
of the Gaussian noise, independent of X : 



P{Y\X) 




2D ^ 




2 ^^ 



(3) 



where Kcr is a normalizing constant. P(Y \X) is called the likelihood of X. 

P{X) is the prior probability, i.e. the prior knowledge of the reconstructed 
solution, which defines the regularization model. 
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X follows a Gibbsian distribution : 

P{X) = (4) 

Zj 

Then X is a Markov Random Field (MRF), according to the Hammersley- 
Clifford theorem. 

Finally, the posterior probability is 

p(X\Y) = —e ( 5 ) 

Zy 

Z and Zy are normalizing terms (partition functions), which depend on the 
hyperparameters (A, : 



Zy = 

Q is the state space (set of Nx x Ny size images with real pixels, hence f2 is 
of dimension [—m,m]^^^y where m is a fixed bound). These integrals are well- 
defined because the state space is bounded. 

Minimizing U{X) in ^ is equivalent to maximixe the posterior probabil- 
ity in 0 (MAP criterion). Therefore, the optimization can be done by either 
deterministic or stochastic algorithms. The former usually needs the unicity of 
the energy minimizer, but the latter (like simulated annealing | 22 |) works bet- 
ter with multiple energy minima. The considered energy is convex so we use a 
deterministic algorithm. 

3 Hyperparameter estimation 

3.1 Introduction 

Hyperparameter estimation is a difficult problem, but is needed when using a 
parametric restoration algorithm, whose results’ quality strongly depend on their 
value. 

At first, we present a few estimators generally used for parameter estimation. 
We then explain why we prefer to use the last one, although it is very hard to 
implement. 

If we consider the prior probability P{X | A, (5), the ML estimator on A and 
6 can be expressed as 

(A, ^) = arg max P{X | A, (5) = arg max (6) 

A, (5 A, (5 Z 

Hyperparameters estimated this way make sense w.r.t. the complete data case, 
i.e. not corrupted by blur and noise. This is the case for a classification problem, 
as seen in 0 or 
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For image reconstruction, hyperparameters depend on the type of degrada- 
tion (variance of noise and size of H). This dependence is impossible if they are 
estimated from the original image (supposing that we know it). 

Lakshmanan and Derin m proposed to use the following criterion : 

(X,A,5)=argmaxP(X,y|A,<5) (7) 

A, A,o 

which involves the joint law of X and Y. To get the optimum the authors pro- 
posed to use the Generalized Maximum Likelihood (GML) algorithm. This crite- 
rion allows to use an approximate optimization technique, consisting of alternate 
maximizations on X and (A, S). This method is suboptimal, but is fast and sim- 
plifies the problem. More recently, the same type of criterion has been used in 
H, and in |2I]. It is equivalent to the MAP estimate for X when A and S are 
fixed, and the parameters reduce to the ML estimator in the complete data case. 
In fact, the ML estimator is applied to the solution reconstructed with the cur- 
rent (A, S). The convergence of this algorithm is not guaranteed and degenerated 
solutions can be found. As in the previous case, sampling is only needed from 
the prior model, without respect to the data. Therefore, this estimator does not 
seem to be appropriate to our problem, that is why we have chosen the following 
criterion. 

3.2 The Maximum Likelihood Estimator 

L. Younes inn pn proposed to use the ML estimator computed with the prob- 
ability of the observed data knowing the A and S parameters : 

(A, = arg max P{Y \ A, J) (8) 

x,s 

As Y is a fixed observation, this probability depends only on the hyperparam- 
eters. To calculate (|8|), the joint distribution P(X, Y) is integrated on Y, then 
Bayes law is used to reduce to prior and posterior distributions. We finally ob- 
tain : 

P(Y|A,(5)= [ P{Y\X,X,S)P{X\X,S)dX (9) 

Jo 

which leads to P(Y | A, (5) = Zy / Z where Z and Zy are respectively the 
partition functions related to the prior and posterior distributions (see section 
E3), and P(Y | Y, A, S) dX. 

The main difficulty of parameter estimation comes from these partition func- 
tions, which depend on (A, (5) but are impossible to evaluate. So we optimize this 
criterion without explicitly computing Y and Yy , as we only compute its deriva- 
tives, by using a gradient descent algorithm. 

The criterion is redefined as J(A, ^) = — log P(Y | A, 6) which yields 

J(A, (5) = log Yy(A,(5) — log Y(A,(5) -h const. (10) 

where the constant (w.r.t. A and 6) is log Per- 
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3.3 Gradient estimation 

To optimize the log-likelihood m, we should evaluate its derivatives w.r.t. the 
hyperparameters. Refer to m for more detailed calculus. These calculus are 
based on the following property : 



Property 1 (Log Z derivative) 

Let Z denote the partition function related to the prior distribution P(X | A, (5) = 
^ have 



d\ogz _ \d^{x\oy 

-^--EXr.PiX\0) 



( 11 ) 



where E[] is the expectation of X w.r.t. P{X \ 0). 

All the derivative evaluations are done by estimation of expectations, which 
could be approximated by the empirical mean estimate 



Ex^P{X) 



'd^{xy 

80^ 



1 ^ 8^{xy 

N ^ 80 ^ 



( 12 ) 



where X^ is the vector of the chain {X^) and is sampled from the law 
P{X I A, (5). N \s the number of samples X^ . The first derivative of the criterion 
(cni, for each component of 6*, is given by 



VJ^. 



Ey 



w 




(13) 



which exhibits two types of expectation : 

• E[] expectation for the prior law X ~ P{X) = 
without respect to the data (section |^| 

• Ey[] expectation for the posterior law X ~ Py{X) = 
with respect to the data (section |^j). 

In order to use a gradient-type descent algorithm, we need to compute (j1 31) for 
various values of So we need to sample from prior and posterior distributions 
to minimize the —log-likelihood. In the following section, we present an efficient 
method to do this. 



3.4 The sampling problem 

Sampling from the posterior density Sampling from the posterior density 
is intractable by means of classical algorithms such as Gibbs sampler m or 
Metropolis m due to the large support of the PSF, inducing a large neighbor- 
hood for the conditional probability. We use the idea introduced by Geman & 
Yang to derive a simulated annealing algorithm for MAP estimation of X. 
Glassical methods generally use an integer bounded space for pixels, and a finite 
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state space. Our algorithm works on real bounded pixels, which is compatible 
with the reconstruction algorithm. Some samples are presented in figure ^ 

The idea is to diagonalize the convolution operator, which is equivalent to 
replacing the convolution by a point to point multiplication, in the frequency 
space. The difficulty is to diagonalize at the same time the non quadratic reg- 
ularization term. That is made possible by using the half-quadratic form of (p 
first developed in m 



Two auxiliary variable fields and (vectors of the same size as X) are 
defined, so that ^{X) = ^*(X, 5^,5^), with 

B\ By) = A2 I (sg- - + [b^ - 

This half-quadratic expansion has been used in nisi for image recovery, by 
alternate minimizations w.r.t. X and B^ ^B^ in order to compute the regularized 
solution, defining a convergent algorithm. 

We assume that P{X) is the marginal of P(X, B ^) : 



L 



Ob X Ob 



P{X, B^, By) dB^dBV = P{X) where Qb = 



y>N X 



(15) 



In fact, P{X) defined by (ES) does not correspond to the P{X) previously defined 
in 0 and (C5 with the (^-function, but is an approximation involving a function 
(p. Indeed, p{u) is the minimum of but there is no reason for P{X) 

to be the marginal of a joint density of X, P^, P^ built with the half-quadratic 
expansion of p. 



/ -\-oo 
-oo 

This is always possible because the integral is positive, and it is convergent due 
to the linear behaviour of at -hoo and to the symmetry of fj. We have computed 
a graph of by a numerical method. There is so little difference between the 
graphs of p and p, that in practice we can use p in the reconstruction algorithm, 
even if the estimated parameters correspond to a p regularizing function. 

P{X,B^,py) oc e ^ ^ ) (16) 

This allows ^ to be quadratic w.r.t. X if P^ and P^ are fixed, and the 
distribution to be “half- Gaussian” in the sense that the law of X given B^ ^B^ 
is Gaussian^ and the variance- covariance matrix is diagonalized by a Fourier 
transform (see [El)- Conversely, P^ and P^ are conditionally independent given 
X, so they are sampled in a single pass. 

We modify the method proposed in |E1 by using a DCT, which allows us to 
fulfill the symmetric boundary conditions, so we avoid producing artefacts on 
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the borders of the image This is equivalent to using images 4 times larger, 
symmetrized w.r.t. rows and columns, but we process real pixels, unlike the 
original algorithm. Due to the symmetry in the frequency domain the number 
of processed pixels remains constant (N^Ny). 

Let us denote the {2Nx)‘^ {2Ny)‘^ block-circulant matrix obtained by ex- 
tension of H. This is the convolution operator applied on the 2N^ x 2Ny sym- 
metrized images, is the generator of composed of the first column of 
The Fourier transform of gives the eigenvalues of In the same 

way we define and generators of and derivative operators on the 
symmetrized images. 

Algorithm 1 (Posterior modified Geman-Yang) 

We first compute fF[h^], DCT[Y] and W defined by 

W = {^\r[hY + ^ {\m\? + \rW))~\ 

where T stands for the Fourier transform. 

Set = X (reconstructed image with initial parameters). 

To obtain X^~^^ from X^, repeat until the convergence criterion is satisfied: 

• Sample 5^, w.r.t. the gradients of X^, using the law 
Pg{b) oc exp (A2 [b{2g/S - b) - ip{b)]) ; 

• Generate an R image, whose pixels follow the Gaussian law N{0, 1/2) ; 

• Sample X^~^^ in the frequency space : 

= DCT - 1 { VF (y DCT [DlB^ +D* B?'] + ^Py]DCT[Y]J + VW 

An image reconstruction scheme (estimation of X knowing Y, A, S) has been 
derived from this sampler m, whose advantages (compared to classical ones) 
are both the speed and the respect of the symmetric boundary conditions. 



Sampling from the prior model The classical Gibbs or Metropolis samplers 
could be used to sample from the prior model. But we prefer to use a prior 
modified Geman-Yang sampler, derived from the previous posterior sampler, to 
improve the computing time of our estimation method. 

Let us consider in the same manner the augmented stochastic process 
P{X, , Ry). To sample from this distribution, we must modify the posterior 
algorithm n in the following way (because we do not take into account the ob- 
served data anymore) : replace W by Wo = f^e 



sample is given by DGT“^ ^ Wo DGT [D^B^ + -b R • 

The differences between Gibbs and Geman-Yang prior samples can be seen 
in figure E|: the second ones are more space varying, because they are sampled in 
the Fourier space. The number of iterations is the same for both samplers (see 
discussions on convergence studies below). 
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A=0.3, <5=30 



A=0.05, (5=30 



Fig. 1. 64x64 samples of the posterior law 0 




A=l, (5=15 A=0.3, (5=10 



Fig. 2. (A) Gibbs and (B) modified Geman-Yang Prior samples, computed with 20 
iterations 
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Convergence studies Some theoretical studies concerning Markov Chain con- 
vergence can be found in m, m and m- We essentially retain that the possi- 
bility of transition between two states of the i? state space is a sufficient conver- 
gence condition. For the augmented stochastic process P(X, we ensure 

that for every realization of the random fields X and the transition 

probabilities P{X \ P^, P^) and P{B^, B^ \ X) are strictly positive. The modi- 
fied Geman and Yang methods we present herein are correct sampling methods, 
because P(X, P^,P^) is an invariant measure as regards these two transitions. 
Indeed, 

VX,Y',P,P' G n : 

P(X, B).P{X' I P) = P(X', B).P{X I P) (17) 

P(P, X).P{B' I X) = p\b',X).p\b I X) 

To evaluate the speed of convergence of the algorithms we can only do experi- 
mental studies, because the Q state space is not finite. The M first samples of 
the Markov Chain must be discarded, therefore a criterion must be chosen to 
determine M. In ^ are detailed several methods : some of them are based on 
the norm between the current and equilibrium distributions ; others use the 
rejection rate of the Metropolis sampler. We prefer a faster method, which con- 
sists in using a plot of the energy ^(Y), and a preliminary study to determine 
the number of iterations which will be used during the estimation step. 

A comparative convergence speed study has been made to prove the effi- 
ciency of our prior sampling method versus classical samplers. To do this, we 
estimate the energy of prior samples generated by Metropolis, by Gibbs and by 
the modified Geman- Yang sampler. Then we look how this energy is varying 
with the number of iterations, and we determine the number of iterations which 
is necessary to reach the equilibrum distribution. Although the state spaces of 
the samplers are not the same (real pixels for Geman- Yang and integer pixels 
for Gibbs and Metropolis), inducing a little difference between the generated 
images, we conclude that the modified Geman- Yang sampler is more interesting 
than the two others. We found that energy needs the same number of iterations 
(from 5 to 10) to reach a stable value, but the speed of each iteration depends 
much on the chosen method. There is a little difference between Gibbs and 
Geman-Yang samples (see fig. 0for an illustration), Geman-Yang ones showing 
larger structures than the others. This difference has been detected over a large 
set of samples, and seems to disappear after a few hundreds of iterations, so 
the convergence seems to be slower for the Gibbs sampler. Glassical samplers 
successively explore the pixels ; our method is much faster, because the pixels 
are processed simultaneously in the frequency space, which enables long distance 
interactions at once. 

We also studied the variation of the convergence speed with the hyperparam- 
eter value, and concluded that a phase transition HD exists. As illustrated in 
figure 0 the energy of samples shows a great variation around A Ac for a fixed 
6 and a nonconvex (/:?- function, that changes the aspect of the generated images. 
In the transition area, we remark that 10 more iterations are necessary to ensure 
convergence of the chain. It means the sampler oscillates between two kinds of 
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states (constant for A > Ac and noisy for A < Ac), before reaching an equilibrum 
state. For a non convex (/:?-function, the energy plot exhibits a discontinuity at 
Ac, that is not the case for convex functions (like the one we use). 





Fig. 3. Gibbs samples and plot of their ^ energy divided by A^, (5 = 5, with a non- 
convex Geman & McGlure (y^^-function 



The main difficulty of convergence is due to the initialization. Practically, the 
speed of convergence depends strongly on the initial image. Obviously the best 
choice for the prior sampler is a constant image, whereas random initialization 
need thousands of iterations to produce the same result. Indeed, prior samples 
are generally small fluctuations around this constant. Therefore the best choice 
seems to be the maximizer of the sampled probability. In the same manner, the 
posterior sampler must be initialized by X, the reconstructed image with the 
A and S hyperparameters. As seen in fig. 0, posterior samples are fluctuations 
around X. 



3.5 MCMCML algorithm 

To optimize the likelihood criterion mu we use a descent method. 

Algorithm 2 (MCMCML) 

• Initialization: The ratio X/6 corresponds to the best Wiener filter and 6 = 
7a. 

• Compute X : X is reconstructed from Y , with the current couple On = 
(An , <^n) • 

• Compute E[] and Ey[] with On •* Generate 2 Markov Chains with the Mod- 
ified Geman-Yang method, one sampled from the prior model, and another 
one sampled from the posterior distribution, and compute the respective 
empirical means 
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• Iteration from 0n todn+i vjni(e„) ^ 

where a < 1 and AO'^ is a step used to compute the second derivative. 

• Stopping criterion: we stop the algorithm if < e. 

Estimation and restoration processes are simultaneous, as the modified Geman- 
Yang sampler is initialized with X. 

Convergence and accuracy If we compute the Hessian matrix (see [2D| for 
details) we see that this matrix depends on the data and the hyperparameters, so 
that the criterion cannot generally be concave w.r.t. 0. As experimental studies 
have shown, the criterion is locally quadratic w.r.t. 0 close to the optimum. This 
enables to use a Newton-Raphson descent method to optimize 0. The second 
derivative is needed, and we evaluate it in an empirical way. To ensure the 
convergence we must choose a smaller than 1 if ^ is far from the optimum. 

The estimation accuracy only depends on the e threshold and the expecta- 
tion’s accuracy, which is given by the size and the number of the samples needed 
for computing the means. An e value corresponding to 3% estimation error or 
less has no visible effects on the reconstruction result. This accuracy is reached 
in less than 5 iterations for the image shown in figure |3 

The images shown in fig. 0 and the degradation model are provided by the 
French Space Agency (ONES) ; they simulate the future SPOT 5 satellite im- 
ages. The algorithm presented herein has been successfully employed by Alcatel 
Space Industries to reconstruct real satellite data provided by the French defense 
agency. These images are confidential, so they cannot be shown here. For the 
CNES 512 x512 image, the computation time is 13 s for estimation (made on a 
64 x64 area) and 12.6s for reconstruction (Sun Ultra 1, 167 MHz). 



Study of the algorithm : multiple solutions For each initial S we can find 
a value of A for which (A, 6) is optimal. Figure 0 shows sets of local minima that 
cancel the criterion derivatives. So for each 6 an optimal A could be found. For 
large S, the model is nearly quadratic, that reduces to a single X/S parameter 
model : this graph is also asymptotically linear. 




Fig. 4. Sets of local minima (A, (5) hyperparameters which cancel the log-likelihood 
derivatives (“hyper surfaces” (/^-function) , the dashed line corresponds to p{u) = 
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Then we fix the S parameter, thus we only estimate A by computing the 
Va likelihood derivative w.r.t. A. The S value is chosen in order to penalize the 
gradients due to noise and to preserve edges. 

Good results are obtained with a fixed S 7a. Changing S does not influence 
the global SNR, but locally changes the restoration quality. High S values im- 
prove edge reconstruction, while low values enable more efficient noise Altering 
in homogeneous areas (see EDI). 

As we can see in figure 0 the SNR is optimal for the estimated A value. The 
SNR is defined by a mean square difference between original and reconstructed 
images. 




Fig. 5. SNR with A varying around A ((5 = 10) 



Other possible algorithms The Generalized Stochastic Gradient (proposed 
by Younes is another possible estimation method. Two Markov chains, 

(Xf) and (X^), are sampled from prior and posterior laws. Compared to our 
method, the same criterion derivative is used, weighted by a factor to control 
convergence, therefore the solution is the same. The essential difference is that 
in the method proposed in inn a single sample is needed for each descent step 
and all samples are kept, without trying to reach the equilibrum distribution. 

A version of this algorithm was used in EH and Em, where the posterior sam- 
ples were replaced by the X restored image with the current hyperparameters, 
which avoids sampling the posterior law, and is equivalent to the Lakshmanan 
& Derin estimator mentionned in section Id. II 

4 Conclusion 

The proposed algorithm is automatic and fast, and ables to simultaneously re- 
construct images and estimate the parameters. 

As we actually do not estimate 6 because of the multiplicity of (A, (5), it 
seems necessary to introduce a priori knowledge of hyperparameters if we want 
the optimal couple to be unique. A possible choice of the P(A, S) law could be 
linked to the probability to obtain (A, S) for a large set of images. 

If we want to get qualitatively better reconstructed solutions, we should take 
into account higher order derivatives. Second order models as in [I^ seem to 
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give better results. But this automatically introduces more parameters to be 
estimated, so more gradient criteria must be evaluated. That does not change 
the nature of the algorithm. 

The chosen model is homogeneous : we assume the same (A, (5) couple is con- 
venient for the entire (large size) image. A better choice would be to divide the 
image into small areas on which we estimate the hyperparameters. Indeed, ho- 
mogeneous areas would be better reconstructed if processed separately, because 
the A value estimated on them is higher than the one which corresponds to edge 
areas. We are currently working on this problem. 
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Fig. 6. a) 256x256 original image extracted from “Nimes” ©ONES, b) PSF (h) 11x11 
pixels, c) observed image (blur and noise a = 1.35, SNR=16.6dB) , d) Wiener re- 
construction (SNR=20.2dB), e) reconstructed solution with the proposed algorithm, 
A = 0.5 and = 10 (SNR=22.1 dB), f) error image 




Auxiliary Variables for Markov Random Fields 
with Higher Order Interactions 



Robin D. Morris^ 

^ RIACS, NASA Ames Research Center, MS 269-2, Moffett Field, CA 94035 USA. 

^ This work was performed while the author held a National Research Council-NASA 

Ames Research Associateship. 
rdmOptolemy . arc . nasa . gov 
Tel: +1 650 604 0158, Fax: 1 650 604 3594 



Abstract. Markov Random Fields are widely used in many image pro- 
cessing applications. Recently the shortcomings of some of the simpler 
forms of these models have become apparent, and models based on larger 
neighbourhoods have been developed. When single-site updating meth- 
ods are used with these models, a large number of iterations are required 
for convergence. The Swendsen-Wang algorithm and Partial Decoupling 
have been shown to give potentially enormous speed-up to computation 
with the simple Ising and Potts models. In this paper we show how the 
same ideas can be used with binary Markov Random Fields with essen- 
tially any support to construct auxiliary variable algorithms. However, 
because of the complexity and certain characteristics of the models, the 
computational gains are limited. 



1 Introduction 

Markov Random Fields (MRFs) were introduced into the image processing liter- 
ature in 1984 [3], and have since been widely used for many tasks, mainly in low 
level vision. Despite the increase in computational power that has become avail- 
able, and the inherent parallelisation that can be applied to the computational 
algorithms, computation with MRF models and single-site updating algorithms 
(the usual forms of the Gibbs sampler and Metropolis-Hastings [5] algorithms) 
is time consuming. In statistical physics applications, where the aim is to simu- 
late large interacting spin systems, the Swendsen-Wang (SW) algorithm [11] was 
developed to speed up the computation, especially at the critical point, when 
simulating the Ising [8] or Potts models. The Multi-Level Logistic model of [3] 
is just the Potts model, and so the SW algorithm is applicable in that case. 
However, it has become apparent that the Ising or Potts model does not capture 
the image characteristics that are important in segmentation tasks [10]. This has 
motivated the development of MRFs with longer range and more complex forms 
of interaction [1,12] to model more adequately the structures present in typical 
segmentation imagery. The application of these models has proved successful, 
but single-site updating algorithms have proved computationally intensive, es- 
pecially when there is a requirement to estimate the hyperparameters of these 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 131-142, 1999. 
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models [4] . Motivated by the success of the SW algorithm to dramatically speed 
up computation with the Ising or Potts model, in this paper we investigate the 
use of the auxiliary variable approach applied to MRFs with long range inter- 
actions, both in the form used in the SW algorithm, and the partial decoupling 
approach proposed in [6]. The SW algorithm has most benefit near the critical 
point of the Ising model. The auxiliary variable methods developed in this pa- 
per appear to be of limited benefit for the simulation of models with long-range 
interactions. This is likely to be because the models are being used well away 
from any critical regimes. Indeed the presence or absence of critical behaviour in 
models with long-range interaction has to be demonstrated on a model-by-model 
basis. Whether the long-range interaction model considered in this paper has a 
phase transition is currently unknown. 



2 Auxiliary Variables for Markov Random Fields 



2.1 The Swendsen-Wang Algorithm and Partial Decoupling 

The idea behind auxiliary variable methods is the following: 

It is desired to simulate a distribution 7t(x). Auxiliary variables u are in- 
troduced, with conditional distribution 7t(u|x). This gives a joint distribution 
7t(x, u) = 7r(u|x)7r(x), with the desired marginal distribution for x of 7t(x). Sim- 
ulation of this distribution is generally performed by alternately updating u and 
X - the idea being to define 7t(u|x) such that the updates cause rapid mixing. 
The realisations of x are those desired. 

The Ising model is defined by 



7t(x) oc exp 




( 1 ) 



where i ^ j indicates nearest neighbour pairs. 

For the SW algorithm the distribution 7t(u|x) is defined such that the Uij 
are independent, and is 



p{uij\x) oc exp {—j31[xi = Xj]) I[0 < Uij < exp {j31[xi = Xj])] (2) 

where Uij can be considered as a continuous ‘bond’ variable between the pixels 
Xi and Xj and ![•] is the indicator function. This results in 

7t(x|u) oc n I[0 < Uij < exp = Xj])] (3) 

iW j 

What does this choice of distribution give us? Considering first p{uij\x), for 
Uij > 1 we must have exp(/3I[xi = Xj]) > 1, or equivalently, Xi = Xj. Thus 
Uij > 1 constrains Xi and Xj to be in the same state. Conversely, if Xi and Xj 
are in the same state, what is the probability of Uij >1? From the conditional 
distribution in equation 2 we have that 



p{uij > l\xi = Xj) = 1 — exp(— /3) 



(4) 
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Since it is only important whether Uij is greater or less than one we may think 
of the Uij as binary bond variables. From equation 4 the bond variable is present 
between two pixels in the same state with probability 1 — exp(— /^). To sample 
u thus involves placing bonds between neighbouring pixels of the same state 
with probability 1 — exp(— /^) and omitting bonds between neighbouring pixels 
of differing states. 

Once the bonds are in place, the conditional distribution 7t(x|u) says that all 
configurations where bonded pixels are of the same state are equally probable. 
Thus to update x we form clusters of connected pixels and assign to all pixels 
of the cluster the same state, chosen uniformly from the allowed states. This 
scheme allows potentially large clusters of pixels to change state at each iteration, 
allowing the Markov chain to explore the distribution freely. 

In the discussion above we have assumed that the parameter f3 in the Potts 
model is positive, inducing clustering. However, the SW algorithm is still ap- 
plicable if (3 is negative. In this case a similar argument gives that neighbours 
in different states are constrained to remain in different states with probability 
1 — exp(/3). This forms ‘clusters’ where neighbouring sites in the cluster must be 
in different states. 

Partial decoupling was introduced [6,7] to overcome some problems with the 
basic form of the SW algorithm when simulating systems with data. In this case 
growing the clusters without taking any account of the data can be unhelpful, 
as the clusters are unlikely to reflect the structure in the data. A modification 
to the conditional distribution of the auxiliary variables allows the data to be 
taken into account when forming the clusters. Later in this paper, we will use this 
modification to systematically reduce the strength of the constraints introduced 
by the complexity of the higher order MRF models. 

The SW algorithm forms clusters which are coloured independently. The 
idea behind partial decoupling is to reduce the probability that bonds will be 
placed, with the consequence that the clusters will not be independent, and so 
the colouring of the clusters themselves will have to be updated using a Markov 
chain Monte Carlo (MCMC) scheme [6]. 

Instead of the definition of p{uij\x) in equation 2 above, we now define 

p{uij\x) oc exp(— ^/3I[x^ = Xj])l[0 < Uij < ex.p{6pl[xi = Xj])] (5) 

where ^ is a fixed constant between zero and one. This results in 



7t(x|u) oc exp Ed - S)f3l[xi = Xj] I X ]^I[0 < Uij < ex.p{S/3l[xi = Xj])] (6) 
\iU j j iU j 

So now we form clusters (or enforce the dissimilarity of neighbours, in the case 
of negative f3) by bonding pixels with probability 1 — ex.p{—Sf3). However, the 
clusters thus formed are not independent, and their colouring must be updated 
conditionally on the pixels neighbouring the cluster. Thus cluster X takes colour 
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conditional on its neighbors, N'{X) with probability 



7r{k\Af{T)) oc exp 



j:iU X,jD A/’(X) 



(1 - S)f3l[k 



( 7 ) 



and is updated using, for example, the Gibbs sampler or the Metropolis-Hastings 
algorithm. 

In [7] the (5’s were chosen to split the lattice up into regular blocks. In [ 6 ] the 
data was used to set the values of 6 to encourage clusters supported by the data. 
In this paper we will use the 6^s to avoid overconstraining the possible updates, 
and to reduce the complexity introduced by negative /3’s. 



2.2 The ‘chien’ model 

In its original formulation the ‘chien’ model [1] was defined as a binary MRF, 
where the potential function considered a 5 x 5 neighbourhood, and 3x3 cliques. 
For a clique of size 3x3 there are 512 configurations. When symmetries are 
removed, this reduces down to 51 classes. These classes are shown in figure 1. By 
considering the energy associated with lines and edges, the 51 parameters, ci to 
C 51 , in figure 1 were reduced down to functions of three parameters, representing 
boundary length (e), line length (/) and noise (n). The reader is referred to [1] 
for full details of this model. 
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Fig. 1. The 51 classes of clique considered in the ‘chien’ model 
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2.3 Auxiliary Variables for the ‘chien’ Model 

Conventionally, the probability of a configuration of an MRF is written in the 
following form 




That is, the total energy of the configuration is made up of a sum over all 
the cliques of the potentials associated with each clique configuration. In the 
Ising/Potts model these cliques are the neighbouring pairs, and the configura- 
tions the state of homogeneity of these pairs. In the ‘chien’ model the cliques 
are 3x3 blocks, and the configurations are those shown in figure 1. For the 
Ising/Potts model, writing the energy in the form 



p(x) oc exp 




(9) 



indicates how to introduce auxiliary variables to induce clusters with either re- 
duced or eliminated dependency between the clusters. To introduce similar aux- 
iliary variables for the ‘chien’ model, we must write the pdf for the ‘chien’ model 
in a similar form. 

The pdf for the ‘chien’ model can be written as 



p(x) oc exp 



v(aj^ hj , Cj , dj , 6j , fj , Qj , hj , ij ) 



( 10 ) 



where the pixels are labeled as shown in figure 2, the sum over j is over all 
sites in the image and v{') is the value given by the classification in figure 1. In 
subsequent equations we will drop the j subscript to simplify the notation. 



a 


b 


c 


d 


e 


f 


g 


h 





Fig. 2. Labeling of the pixels in 
a clique of the ‘chien’ model 




Fig. 3. Sample from the ‘chien’ 
model, e = 0.8 , 1 = 1.5, n = 1.6 



The energy of a clique can be written, using this labeling, as 
v{a,b,c,d,e,f,g,h,i) = 
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I[a = e]l[b = e]l[c = e\l[d = e]l[f = e\l[g = e\l[h = e]l[i = e]ci 

+ I[a = e]I[b = e]I[c = e]I[d = e]I[f = e]I[g = e]I[h = e](l - I[z = e])c 3 

+ I[a = e]l[b = e]l[c = e]l[d = e]l[f = e]I[g = e](l - l[h = e])l[i = e]c 4 

+ I[a = e]l[b = e]l[c = e]l[d = e]l[f = e]I[g = e](l - l[h = e])(l - l[i = e])cs 

+ (1 - I[a = e])(l - l[b = e])I[c = e](l - d[d = e])(l - ![/ = e]) 

l[g = e]l[h = e]l[i = e]c 35 



+ (1 - l[a = e])(l - l[b = e])(l - I[c = e])(l - l[d = e]) 

(1 - ![/ = e])(l - I [5 = e])(l - I[h = e])(l - I[i = e])c 2 (11) 

where the c’s are the clique parameters from figure 1. A computer algebra pack- 
age can be used to expand this expression, and then to reduce it into a simple 
form, where each entry is a product of terms of the form ![• = •] , for example I[a = 
e](c 7 -C 2 ) andl[a = e]l[b = e]I[c = e]I[^ = e]l[h = e](-C2+3c7+2cio-3ci5+ci8- 

2 c 20 — 3C23 — C26 ~ ^28 + C30 — C32 — C34 + 2C37 + 2C38 — C40 — C42 — C44 3C45 + C46 + C47 ) 

etc. In fact, terms involving all pairs, triples, 4-tuples, 5-tuples, 6-tuples, 7-tuples, 
8-tuples and the single 9-tuple are present in this representation (some with a 
coefficient of zero). This enables us to write the energy in the form 

v{a,b,c,d,e,f,g,h,i) = 

I[a = e](3^ae0+^[b = + ■ ■ ■ 

+ I[a = e]I[6 = I[a = e]I[c = e]p^ce<} + ■ ■ ■ 

+ I[a = e]l[b = e]l[c = e]%bceO+ • • • 

+ I[a = e]I[6 = e]I[c = e\l[d = e]/3^abc£ie<>+ • • • 

“h I[tt 6]I[c? c]I[J^ * • * 

+ I[a = e]l[b = e]l[c = e]l[d = e]l[f = e]l[g = e]%f,cde/sO+ • • • 

+ I[a = e]l[b = e]l[c = e]l[d = e]l[f = e]l[g = e]l[h = e](3^bcdefghC>+ ■ ■ ■ 

+ I[a = e]l[b = e]I[c = e]l[d = e]l[f = e]l[g = e]l[h = e]l[i = e](5^j^abcdef ghio 

(12) 



In this representation, the distribution can be written as 



7t(x) ocj^exp '^lk[a,b,c,d,e,f,g,h,i]l3^j^^b,c,d,ej,g,h,i<>\ (13) 



where I/c[-]/^^^ are the terms in equation 12, k being the index of the term. This 
is now in a form which makes application of the extension to the SW algorithm 
in [2] clear. 
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We introduce a set of independent auxiliary variables, u/c, corresponding to 
each of the terms in the representation of equation 12, with conditional distri- 
bution 

= exp (-Ifc [•]/?, ^)I[0 < Uk{j) < exp (lfc[-]/3^^)] (14) 

That is, for each term I/c[-]/3^^we have a set of auxiliary variables, Uk as in the 
standard SW algorithm. Each element of u/^, Uk{j) is independent and drawn 
from a uniform distribution 7/[0, exp(I/c[aj, hj^ . . resulting in a conditional 

distribution for x of 

^(xiu)=nn lK(i) < exp(Ife[-]/3,^)] (15) 

j k 

So we now have that x|u is uniformly distributed, provided that the constraints 
introduced by the particular realisation of the u variables are satisfied. These 
constraints now form ‘bonds’ between groups of up to 9 pixels at a time, con- 
straining the groups to be in the same state, for > 0, or groups of up to 9 
pixels are constrained to not all be in the same state, for < 0. An analogous 
procedure can be performed for any binary MRF. 

We can divide the terms into two classes, those where the coefficient is 
greater than zero, and those where it is less than zero. The update procedure is 
then as follows 

1. For each term in the expansion of equation 11 with > 0, for each site j, 
if all the pixels concerned are in the same state, constrain them to be in the 
same state in the next iteration with probability 1 — exp(— /3^^). 

2. For each term in the expansion with < 0, for each site j, if all the pixels 
concerned are not in the same state, constrain them to not all be in the same 
state in the next iteration with probability 1 — exp(/3^^). 

3. Choose any random colouring which satisfies the constraints from steps 1 
and 2. 

The constraints generated in step 1 (type 1 constraints) are easily dealt with 
- there are only two ways that a group of n pixels can all be homogeneous in a 
binary MRF. Thus we can use all of the constraints with > 0, irrespective 
of how many pixels are included in that term, to form clusters of pixels in a 
similar manner to the SW algorithm, where all the pixels in a cluster must be 
in the same state after the update. 

The constraints from step 2 (type 2 constraints) are more problematic - there 
are 2^ — 2 ways a group of n pixels can not all be in the same state in a binary 
MRF. The update strategy used was that known as ‘generate- and-test’ [9]. This 
heuristic method works well when either the density of the constraints is low 
(such that there are many configurations that satisfy all the constraints and 
finding one is relatively easy), or when the density of constraints is high, when 
there are very few solutions, but local changes that satisfy the constraints will 
almost always move towards one of the few global solutions. 

The ‘generate-and-test’ approach results in the following algorithm. 
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1. Start from a random configuration that satisfies the type 1 constraints 

2. Go through the list of type 2 constraints until one is not satisfied 

3. To satisfy this constraint 

(a) flip the state of one of the clusters involved in this constraint 

(b) if the constraint is still not satisfied, un-flip this cluster, and flip another 
until the constraint is satisfied 

4. Go to step 2 until all the constraints are satisfied 

Because the constraints are generated from the current colouring of the pixels, 
we know that there is at least one colouring which satisfies all the constraints 
in stage 2. (However, this colouring is not of interest to us; the whole point 
of constructing the auxiliary variable algorithm is to And a recolouring that 
is significantly different from the current colouring.) Irreducibility is, however, 
guaranteed, as there is a non-zero probability of no constraints being placed, 
resulting in the x variables being independent, and so any state can be reached 
in one update. 

Figure 3 shows a sample, generated by the single-site Gibbs sampler, from 
the ‘chien’ model, with the parameters being e = 0.8,/ = 1.5, n = 1.6, after 
10,000 iterations. These parameters were chosen to give a sample which shows 
regions together with fine structure. Starting from a random initial image (each 
pixel is black or white with probability 0.5), figure 4 shows the clusters induced 
by the type 1 constraints for this set of parameter values - the bond-graph shows 
how the pixels are constrained, and the cluster map shows how the bonds divide 
the image up into groups (there are actually 109 regions in this image). Glearly 
with this density of type 1 constraints finding a colouring which satisfies the 
type 2 constraints will be easy. It will, however, be very similar to the initial 
colouring. This motivates the use of the partial decoupling approach - the (5’s 
can be chosen to reduce the density of type 1 constraints. 

If the initial state is all of one colour the situation is worse - almost every 
pixel will be bonded into one region, and the sampler will be essentially immobile. 



2.4 Partial Decoupling for the ‘chien’ Model 

The density of the type 1 constraints lead to an almost immobile algorithm 
in the previous subsection. Here we consider how partial decoupling may help 
the mobility of the sampler by reducing the density of the type 1 constraints 
and by eliminating the type 2 constraints. This also results in an easier update 
algorithm. 

To derive the partial decoupling algorithm, the same representation of the 
energy as in equation 12 is used. However, the conditional distribution of the 
u/c’s is now 

Tr{uk{j)\x) oc exp I[0 < Uk{j) < exp (16) 

where is the factor associated with each of the auxiliary variables. This en- 
ables the influence in the clustering and anti-clustering of each term in equation 
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Fig. 4. Bond graph (left) and region map (right) from the type 1 constraints for 
the SW type algorithm (see text) 



12 to be controlled. In practise this enables the complexity of the update algo- 
rithm caused by the terms with < 0 to be eliminated, by choosing = 0 

for these terms. In the experiments described below, the same value of S was 
used for all the terms with > 0. 

With this set of auxiliary variables, the conditional distribution of x|u is now 



7t(x|u) OC 

3 



n exp X n exp(lfc[-](l 



/c:/3^^>0 



(17) 



The final term of this equation is the cluster constraints - when updating the 
u/c : > 0, the pixels are bonded with probability 1 — exp(— (5^^^^), and 

this terms says that all the pixels in a cluster must be the same colour. 

The first two terms give the distribution of the colours of the clusters. From 
them we can easily derive the conditional distribution for the colour of each 
cluster, given its neighbours. For computational purposes, it is convenient to 
transform the representation back into the form of equation 10, except that 
now the potentials of the configurations have been modified by the inclusion 
of the (1 — (5^^) terms. This allows simpler computation when computing the 
probabilities of the allowed colours for the regions when implementing the Gibbs 
sampler. 

This results in a form of block update algorithm. The weakened cluster con- 
straints form regions, the size and shape of which is a function of the model. 
These regions are then recoloured with probabilities which reflect the cluster 
formation process and the model. 

Figure 5 shows the bond-graph and the corresponding cluster map for the ap- 
plication of the partial decoupling algorithm to an initial random image. Clearly 
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the density of the bonds is much reduced from figure 4, so the algorithm should 
be more mobile. The colouring of the clusters is updated conditionally on the 
cluster’s neigbours, using the Gibbs sampler. Figure 6 shows the initial image 
and the result after one iteration. The algorithm clearly moves rapidly towards 
the equilibrium distribution. However, further updates using the Partial De- 
coupling algorithm rapidly move towards an almost uniform image - even with 
reduced bonding strength, the number of possible ways the pixels in a uniform 
region can be bonded results in most of them being joined into one cluster, and 
then the conditional update will preferentially re-colour the regions to eliminate 
edges. The algorithm is thus of limited applicability - it moves rapidly towards 
the equilibrium distribution, but moves slowly once it has converged. 






Fig. 6. Initial random image (left) and image after one iteration of the Partial 
Decoupling algorithm (right) 
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3 Conclusions 

We have shown how to construct an auxiliary variable algorithm for the MRF 
known as the ‘chien’ model, and explained how this method may be used with an 
essentially arbitrary binary MRF. Because of the strength of interaction in the 
model, however, introducing a full set of auxiliary variables results in a sampler 
which moves very slowly. Reducing the set of auxiliary variables, and reducing 
the influence of those included, enables an algorithm to be constructed which is 
a form of block-update algorithm. This moves rapidly from a random start point 
towards the equilibrium distribution, but then moves into one of the modes of 
the distribution, and becomes immobile. 

The SW algorithm for the Ising model shows most spectacular improvement 
at the critical point, when the correlation length becomes infinite. The ‘chien’ 
model with parameters corresponding to the characteristics of real images does 
not seem to exhibit this behaviour, and so auxiliary variables are of less benefit. 
Whether the ‘chien’ and other higher order models do show critical behaviour 
for some parameter values is an open problem. If they do, then the algorithm 
described in this paper should be very useful for simulating the models in those 
behaviour regimes. 
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Abstract. This paper is concerned with hierarchical Markov Random 
Field (MRF) models and their applications to multispectral image seg- 
mentation. We present an extension of the classic Gaussian model for 
the modelization of the data likelihood based on a Generalized Gaus- 
sian (GG) model, requiring a ’’shape parameter”. In order to obtain an 
unsupervised multispectral image segmentation, we develop a two step 
algorithm. In the first step, we estimate the parameters associated with 
a causal Markovian model (on a quad-tree^) and a generalized Gaussian 
modeling for the data-driven term, by using an Iterative Gonditional 
Estimation (IGF algorithm [16]). One of the originality of this paper 
consists in explicitly decorrelate the multispectral observations during 
the estimation step on a quad-tree structure. A second step gives the 
segmentation map obtained with the estimated parameters, according 
to the Modes of Posterior Marginals (MPM) estimator. The main moti- 
vation of the paper is to extend the variety of noise models which results 
of the distribution mixture on multispectral images. Some results on syn- 
thetic and SPOT images validate our approach. 

Keywords: correlated sensors. Generalized Gaussian, multispectral seg- 
mentation, quad-tree, parameter estimation. 



1 Introduction 

One of the main interest of MRF modeling coupled with a Bayesian formulation 
consists in establishing an explicit link between observation field and label field 
jointly to the introduction of contextual and a priori information [9]. Neverthe- 
less, non-causal MRF models are known to yield iterative and computational 
intensive segmentation algorithms [12]. Besides, the problem of unsupervised 
Markovian segmentation is complex. The main difficulty is that the estimation 
of parameters is required for the segmentation, while one or several segmenta- 
tions are usually required for parameter estimation [13]. In this paper, we work 

^ Acknowledgments: The authors thank P. Perez and P. Bouthemy for fruitful com- 
ments and suggestions. 
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on a hierarchical MRF attached to the nodes of a quad-tree. The specific struc- 
ture of these models results in an attractive causality property through scale [9] , 
which allows the design of exact non-iterative inference algorithms [12], similar 
to those employed in the framework of Markov chain models [8] . 

We consider a couple of random fields W = (X, 1^), with Y = s e S} 

the field of observations located on a lattice S of N sites s, and X = {X^, s e S} 

c 

the label field. Each of the takes its value in Aohs = {0, ...,255} where 

r fi) (c^ 1 ^ 

C stands for the number of spectral bands, i.e., Yg = W ,...,W ^ is a 



vector of size C. Moreover, each Xg takes its value in Aiahei = 
where K is the number of classes. The distribution of (X, Y) is defined, firstly, 
by Px(^), the distribution of X supposed to be stationary and Markovian in 
scale, and secondly, by the site-wise conditional data likelihoods iVs \^s ), 

which depend on the concerned class label Xg. If the data are assumed to be 
independent conditionally on the labeling process X, one gets 



PY\x(y\^) = n Px\x(ys\^) = n Px\xSys\^>^) (i) 

sU S sU S 

In real life, labeled samples are usually not available and we have to es- 
timate the parameter from unlabeled samples, i.e., the label field X is hid- 
den. In statistics, the problem is well known as the incomplete data problem. 
The observations Y correspond to the incomplete data whereas W constitutes 
the complete data. Prior distribution Px{x) depends on some parameter vec- 
tor while data likelihood Py\x iy\^) depends on another parameter vec- 
tor ^y. Both of them has to be estimated. Joint and posterior distributions 

Py,x = Px{x)Py\x (y\^) ^x\y i^\y) ^ Px{x)Py\x (y\^) 

pend on ^ This will be made explicitly when necessary, i.e., denoting 

posterior distribution as Px^Y i^\y^^)' 

If we note fi{yg) = PYs\x (ysl^s = one considers generally [7,5,6] that 
the general expressions of are known and depend on a parameter 

set ^y. The general case often studied is the Gaussian mixture M{pi^ which 
is totally described by ^y = { (tt^ , , a?) Vi G [1,X]} where tt^ represents the 
proportions of each class in the mixture. But in the general case, fiijjg) is not 
exactly and accurately known and one needs to introduce a more general shape 
for the site-wise likelihood. 

The Generalized Gaussian model introduced here for the data-driven term, 
has proved successful in capturing the variety of the noise laws present in the 
distribution mixture of synthetic multispectral images. For example, as shown 
in [14], in the specific case of sonar images, the generalization of the conditional 
likelihood probabilities (from Rayleigh law to Weibull law) increases the ade- 
quation between observations and modelization thanks to the introduction of a 
’’shape parameter”. The number of corresponding parameters is indeed incre- 
mented, but the shape parameter allows to best fit observations and model. 
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Unfortunately, in reality, most multispectral observations are correlated [10] 
and in this case, the data driven term expression is not always known. In the 
Gaussian case, sensors can be independent or not because the mathematical ex- 
pression of the Gaussian likelihood Py^\x (ysl^s) is nevertheless well known. 
Moreover, as mentioned above, the conditional probabilities are not always nec- 
essarily Gaussian in practice. In this paper, we propose to generalize the method 
presented in [8] to the case of possibly non-Gaussian correlated sensors (condi- 
tioned to the random process X) in a quad-tree structure. To do that, we take 
the following way [17] : 

— each sensor is decorrelated from the others by using a Gholesky transform 
noted Ai for the class, 

— then, the components of Zg = Ai% are assumed to be independent condi- 
tionally to Xg = ooi : 

c 

= P(4^\-nZP\Xs) = l^«)- (2) 

C=1 

The approach we choose to solve the unsupervised MRF-based segmentation 
problem consists in having a two step process. First, a parameter estimation step 
is conducted to infer both the noise model parameters and the hierarchical MRF 
model parameters. In this paper, the parameter estimation method is based on 
the IGF algorithm [16]. Let us notice that, with a similar approach, the Stochas- 
tic Expectation Maximization (SEM) algorithm [4] [3] aims at determining the 
Maximum Likelihood estimates of the parameter by making use of simulation 
of the missing data. This way requires the use of the ML estimators of the GG 
modeling given in [15]. Nevertheless, the IGE approach is more general, because 
we keep the choice of the estimator of ^y. Then, a second step (segmentation 
step) is devoted to the segmentation itself, using the values of the estimated 
parameters. At the end, the final label field corresponding to the segmentation 
is processed by using the Modes of Posterior Marginals (MPM) estimator [12]. 

This paper is organized as follows. In Section 2, we detail the Generalized 
Gaussian modelization and the algorithm for sensor decorrelation. Section 3 
presents estimation and classification steps on the proposed hierarchical Marko- 
vian model in the multispectral case. Experimental results obtained on synthetic 
scenes are reported in Section 4. Then, we conclude with some perspectives. 

2 Noise Model 

2.1 Generalized Gaussian Noise Model 

The noise model considered here is referred to as Generalized Gaussian (GG) 
noise and is obtained by generalizing the Gaussian density [11] to represent 
different degrees of exponential decay 

Pz{z) = [2r{l/p)]^^T]{p)p exp [-{t]{p)\z - Ii\f] 



(3) 
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where U(.) is the gamma function, p is a positive parameter governing the rate 
of decay (p = 1 for Laplacian noise, p = 2 for Gaussian noise, p > 8 for nearly 
uniform noise), /i is the mean, and^ 



v{p) 



_a^r{l/p)_ 



-|l/2 



(4) 



where is the variance. The shape of the GG probability density function 
(pdf) (3) is illustrated in Fig.l for some values of the parameter p. We note that 
for small values of p (z.e., p < 2), probability density functions have heavier 
tail than those of the Gaussian densities which produce an impulsive random 
sample. This noise model is well fitted to some physical data properties. For 
example, according to Algazi and Lerner [2] , densities representative of certain 
atmospheric impulse noises can be obtained by picking 0.1 < p < 0.6. 




Fig. 1. Generalized Gaussian probability density function with = 1, /i = 0 
and for sever als values of p. 



Statistical properties of the maximum likelihood estimators of the General- 
ized Gaussian density function are considered in [15]. The maximum likelihood 
estimators of the GG parameters (z.e., mean, variance and shape parameter) can 

^ The symbol = stands for “equals by definition” 
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be used to derived the SEM algorithm from one realization of the random field 
X according to the a posteriori probability. However, in the following, estima- 
tors based on the the empirical moments are derived and implemented because 
they lead to easier {i.e., less time computational) estimators. Hence, the ICE 
algorithm is then performed (see subsection 3.3). 



2.2 Moments Based Estimators of the GG Parameters 

Let us consider N random samples Zi, i = . . . , A/", assumed to be independent 

and identically distributed (iid) from the pdf given in (3). Three parameters are 
required to characterize the GG pdf from the samples of Zi. Eirst, the empirical 
mean /I is performed 




i=l 



Secondly, the estimator of based on the second-order moment, is simply 
the empirical variance 



and finally, the estimation of the shape parameter p is based on the estimation 
of the centered fourth-order moment /r(4) of the GG pdf given by the general 
relation 



rfE±i) 

M(n) = E[{Z - E[Z]r] = for n even. (7) 



By substituting (4) in (7) for n = 4 and using estimators of mean /2, variance 
and centered fourth-order moment /i(4), we obtain the following relation 



/(VpV 



(8) 



where 



1 ^ 

^( 4 ) = 



(9) 



i=l 



Eq.(8) can be numerically solved^ in order to obtain the estimator p of the 
“shape” parameter p. 



^ Eq.(8) gives a unique estimate because of the monotone decreasing property of func- 
tion h{p) for p > 0. 
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2.3 Correlated Non-gaussian Mixture Model 

In order to validate the use of GG marginal distribution for image segmentation, 
algorithm will be, firstly, performed on synthetic image and secondly, on SPOT 
image. In the following, the generation of synthetic images is presented. The 
resulting correlated non- Gaussian mixture model will be considered for the image 
segmentation algorithm. 

The model used for the correlated non- Gaussian data observed by the sen- 
sors is presented in Fig. 2. An independent zero-mean random vector Zs = 
[Z^\ Z^\ . . . , Zs^^Y ^ conditionally to with any marginal distributions 
Py{c)^ = uji) = (for c = 1 ,... ,C), and unit diagonal covari- 

ance matrix, is passed through a spatial linear filter Li of length C to produce 
a correlated non-Gaussian vector conditionally to X^, with a dependence struc- 
ture generated by Li. The mean of each sensor for c = 1 . . . C is finally 

added to obtain the correlated non-Gaussian vector Ys = 
conditionally to X^. 




Us 



( 1 ) 



Us 



( 2 ) 



U. 



{C) 



Fig. 2. Gorrelation model of an array of C sensors. is a C x C mixture 
triangular matrix for the class. 



The Fig. 3 presents simulated images for marginals modeled as Generalized 
Gaussian pdf with the correlation structure presented in Fig. 2. The channel 
for c = 1, 2 and 3 is plotted by resizing the dynamic on 256 grey levels. 

The ground truth image is the same as Pieczynski et-Al. in [17]. The noise 
parameters are chosen such that : 



means and standart deviations are supposed to be the same for each channel 
(ie., and <j\ = 

correlation coefficients are supposed to be the same for cross channels (ie., 

y ^ ^ y,(l,3) ^ y.(2.3)s 

ri ri Hi Hi ) 

the “shape” parameters are also supposed to be the same for each channel 

/ • 2 ^5(2) ^?(3) \ 

(z.e., pI =pY ^ =pY ^ =p.^ 

values of the above parameters for the two classes {i = 1 and 2) are set 
to : pf = 0.5, P 2 = 4, ml — m\ = 2a\ = 2<J2 and p\ = P 2 = 0-8- 
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Modeling the noise process as Fig. 2 suggests the reversed model : the depen- 
dence structure of this model is straightforward, but solving for the marginals 
is more difficult. In order to circumvent this problem, Pieczynski et- 
Al. [17] suppose that marginals necessarily belong to a set of probability density 
functions. In this paper, we assume the independence of GG distributions gf{zs^^) 
which are representive of a large variety of pdf governing by the rate of decay 
parameter. 




channel channel channel Y^^^ 

Fig. 3. Non- Gaussian correlated observed image Y acording to the noise model 
presented in Fig. 2 with GG marginals gf{zs'^^). 



The observed random vector 1^, conditionally to Xg = uji^ described by its 
positive-definite symmetric covariance matrix, Xf (i = 1 ,... ,iC), admits the 
Grout (Cholesky) factorization 

Y = LiLj ( 10 ) 

where is a C x C unique lower triangular matrix. Then, the correlated vector 
Yg can be transformed into an uncorrelated vector 

Z,=AiX withAi = Ly. (11) 

Directly from (11), we have and the pdfs of Zg and Yg are related 

via 

= \A,\-^gt{[A^). ( 12 ) 

^ ^ (c) 

where |.| denotes the determinant. 

The following section presents the estimation and segmentation steps based 
on a hierarchical Markovian approach. 




Unsupervised Multispectral Image Segmentation 149 



3 Hierarchical Markovian Approach 

3.1 Introduction 

In order to perform an unsupervised multispectral image segmentation, we should 
use Markov chains as proposed in [ 8 ]. Nevertheless, this approach needs a trans- 
formation of the image into a one dimensional set [18], which decreases the 
wealth of 2D representation. In this paper, we prefer a more general framework 
for image segmentation using a hierarchical Markovian modelization, based on 
the quad-tree [ 12 ]. Furthermore, as explained in section 2 , we will consider Gen- 
eralized Gaussian family as likelihood model. 

3.2 Properties on the Quad-tree 

We consider two sets of random variables : the labels X = (Xs)sDS' and the 
observed data Y = 5 . Both variables are indexed by S', a set of pixels on a 

quad-tree (c/. Fig 4a). S = = {S^, ...,S^} where R is the number 

of levels and = r is the root of the tree [12]. In our application, Y is defined 
only for n = 0 . is the unique parent of a pixel s and its four children will 
be designated hy s-\- = {t : s = t^} (c/. Fig. 4b). In the following, and Y'^ 
stand for (Xs)sDS'^ and {%)sUs^ respectively. Under Markovian assumption of 
causality in scale, i.e., P [X^\X^ , k > n) = P [X^\X^^^) we assume that the 
transition probabilities can be factorized as : 

P(V”|V"+ 1 ) = Y[ 

sD 

and that the observation model has for expression : 

p(viv)=n^^(xiv,) 

sD S 

As a consequence, we deduce that (X, Y) is Markovian on the quad-tree [ 12 ] 
with 

p (V, Y)=p (Xr) n ^ (xiv,) n p 

sU S s>r 



3.3 The Segmentation Algorithm 

We note the prior root probabilities tt^ and the prior parent-child transition 
probabilities aij: 

TTi = PXr {Xr = UJi) (13) 

Ciij — — Cc^ji |Xg(8) — ^i) 

In the following, we will use the a posteriori probabilities: 

^s('^) = ^Xs\Y ~ ^«|2/] 

(J'^ — ^Xs ,x^ ^Y ’ Xg<s> — tOi \ y] 



(15) 

(16) 
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Fig. 4. a) The quad-tree structure b) Notations on a dyadic tree 



These probabilities are estimated with the “Forward- Backward” algorithm [12,10] 
which allows the non iterative and exact estimation of the marginal a posteriori 
distribution. The different parts of the algorithm are the following : 



A) Initialization Step Before to start the algorithm, we need to initialize 
some parameters : 



^[ 0 ] = 






withV(i,j) G [1,A] and c G [1,C] 




100 
0 1 0 
00 1 



^[ 0 ] ^ Y 



(17) 



(18) 

(19) 



(a) Data model parameters = 2| with c G [1, A] 

are obtained by Fuzzy A-means clustering technique [1] on image gen- 
eralized for multispectral images. 

(b) Prior model parameters are initialized by tt^ = aij.^. = 2 (k^i) 

Ciii = 



Z is the image in the decorrelated space and ^z{c) for c = 1, . . . , C are the 
conditional parameters in this space. In the following, conditional laws in T, 



MVs) = PXjfKs\-ys\^s =^i] 

will be deduce from Generalized Gaussian conditional laws in Z, 












[zp\uji] = GG ( 20 ) 
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by 



/“ (y.) 



n 9."’ ("•'’'“) X 

c=%R,G,BO 




where Ai is defined in Eqs.lO and 11. [q] represents the qth iteration. 



( 21 ) 



B) Iterative Conditional Estimation Step For the iteration : 

(а) Bottom- up pass on the quad-tree allows the estimation of {xslvu s) 

and 4 ft: ^ , 2 /d s)- Particularly, for the root we have s = r and 

^(^r|?/D r) ~ 1^)' 

(б) Top-down pass provides one realization of X = by random drawing 

according to ^x,jfK ^,Yu s)- 

(c) Updating of prior parameters: 



[,+l] _ 

Ss>r 

= (5(x|?l,Wi) 

where S(xs,x>i) = 1 if Xs = Cc>i and 0 elsewhere. 

(d) Computation of according to the empirical moments. 

(e) is calculated with the Cholesky decomposition such as 



(/) Determination of 






Zk] = Vs e S° and a;[«l = oji 



( 22 ) 

(23) 



(g) Calculation of for evaluation of Eq. (21). 

{h) q ^ q + l.li q < qmax, we repeat the iterative estimation step, using 



C) Classification Step the segmentation is performed according to the MPM 
criterion: 



Vs = arg inax ly = i^i\y) = arg inax (i) 

U!iU Aea U!iU Aea 



(24) 



4 Images Segmentation Results 

In this section, we present some results obtained on synthetic multispectral im- 
ages presented in Fig. 3 and one SPOT image. 
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(a) Gaussian correlated noise model (b) non-Gaussian correlated noise model 

Fig. 5. Segmented image on a quad-tree structure with Gaussian and non- 
Gaussian correlated noise 



4.1 Synthetic Multispectral Image 

The segmentation taking into account the GG model is best fitted to recover the 
real picture (Fig. 5b). The estimated parameters are close to the true parameters 
used to generate the multispectral correlated pictures. The obtained values for 
the estimation of the “shape” parameters arepf = .55 andp^ = 3.8 again = .5 
and p| = 4 for the true values. The percentage of error of bad labeled pixels is 
30% in the Gaussian case and 20% in the GG case. 

Other simulation results allow us to assert that : 

— if the parameter p is far from 2 (particularly for small values, z.e., p < 2) 
then the GG model is well adapted in comparison with the Gaussian one, 

— the interest of the GG model appears for strongly noised images. 



4.2 SPOT Image 



Fig. 6 presents the three channels (infrared (R), green (G), blue 
(B)) of a SPOT image which is represented with RGB colors in Fig. 7. The 
resulting segmentation using the GG noise model is presented in RGB colors on 
Fig. 8. 

Six classes are considered. Table 1. presents the estimated “shape” parame- 
ters for each class. It is interesting to note that this parameter is different from 
2 on these real data and allows to best fit GG model to data instead of Gaussian 
model. However, by using the classic multispectral Gaussian model, a very close 
segmentation to Fig. 8 is obtained. It is probably because the image doesn’t look 
like strongly noisy. 
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channel red channel green channel blue 

Fig. 6. Observed SPOT image Y for each channel. 



class rYi 


1 


2 


3 


4 


5 


6 


channel R 


1.28 


5.64 


1.01 


3.42 


1.68 


2.26 


channel G 


0.71 


1.81 


0.97 


4.31 


2.30 


2.63 


channel B 


0.48 


2.88 


1.73 


2.35 


1.99 


1.19 



Table 1. Estimation of the shape parameter 




Fig. 7. Observed multispectral SPOT image 
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class 1 
class 2 
class 3 
class 4 
class 5 
class 6 



Fig. 8. Segmented image on a quad-tree structure with non-Gaussian correlated 
noise model. 

5 Conclusion 

This paper presents a hierarchical MRF modeling to segment multispectral cor- 
related images. The interest of our approach consists in the modeling of the data 
likelihood, which is based on a generalization of Gaussian pdf. Using a shape pa- 
rameter, this modeling allows to capture the variety of the noise laws present in 
the distribution mixture. As our goal is to make the segmentation unsupervised, 
of course the number of parameter increases, even if the shape parameter allows 
to best fit observations and model. This is why we chose an Iterated Gondition- 
nal Estimation based on the empirical moments, estimated on one X random 
sample at each iteration, for computational cost considerations. The parameter 
estimation step is also computed according to a IGE approach on a quad-tree 
and takes into account, explicitly, the correlation between sensors. Then, the 
segmentation step is based on MPM criterion. 

This approach has been validated on synthetic correlated multispectral pic- 
tures, as for the estimation step than for the segmentation one. More precisely, we 
have shown that the Generalized Gaussian modeling improves the performance 
of the segmentation algorithm, in the presence of impulsionnal noise. 
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A Generalized Gaussian Generating Procedure 

In this appendix we present a generating procedure for the independent identi- 
cally distributed (iid) random variable Zi (i = 1, . . . , A^) modeled according to 
the generalized Gaussian distribution. 

The procedure starts by considering a random variable X distributed accord- 
ing to the gamma pdf : 



where u{.) is the unit step function and p is the positive parameter governing the 
rate of decay of the GG distribution to be generated. The generation of a gamma 
random variable is easily performed by exploiting its reproductive property (see, 
for example, the algorithm G-2 of [19]). 

The second step of the procedure considers the nonlinear transformation 



where r]{p) is defined in (4). It follows immediately that the pdf of Y is given by 



Finally, a random variable with GG pdf is easily generated multiplying Y by 
a random variable taking only equiprobable values ±1. 
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Abstract. We propose a vector representation approach to contour esti- 
mation from noisy data. Images are modeled as random fields composed 
of a set of homogeneous regions; contours (boundaries of homogeneous 
regions) are assumed to be vectors of a subspace of L^(T) generated 
by a given finite basis; B-splines, Sine-type, and Fourier bases are con- 
sidered. The main contribution of the paper is a smoothing criterion, 
interpretable as a priori contour probability, based on the Kullback dis- 
tance between neighboring densities. The maximum a posteriori proba- 
bility (MAP) estimation criterion is adopted. To solve the optimization 
problem one is led to (joint estimation of contours, subspace dimension, 
and model parameters), we propose a gradient projection type algorithm. 
A set of experiments performed on simulated an real images illustrates 
the potencial of the proposed methodology 



1 Introduction 

Boundary estimation/detection plays a key role in image analysis/understanding, 
pattern recognition, computer vision, computer graphics, and computer-aided 
animation. Although the approaches to contour estimation are numerous, most 
of them share the same spirit: contours are obtained through the maximization 
of objective functions composed of a prior term^ that favors contours with some 
attributes (e.g., continuity, smoothness, elasticity, and rigidity), and a data term, 
that measures the adjustment to data. 

As in many other fields, different aspects of contour estimation have been ad- 
dressed either under the energy-minimization framework or under the Bayesian 
framework. 

1.1 Energy-Minimization Framework 

Under the energy viewpoint, data and prior terms are interpret able as external 
energy, which attracts the contour to the desired features, and internal energy 
(e.g., due to contour tension and rigidity), respectively. This perspective was 
introduced in the original work of Kass [16], where the concept of snake (or active 

* This work was supported by the Portuguese PRAXIS XXI program, under project 
2/2.1.TIT/1580/95. 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 157 173, 1999. 
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contour, or deformable model) was put out: “A snake is an energy- minimizing 
spline guided by internal constraint forces and influenced by image forces that 
pull it towards features such as lines and edges” . 

Since its introduction, the initial concept of active contour has been modified 
and improved in order to adapt it to different image classes and to overcome 
some of its drawbacks; namely, snake attraction by artifacts, snake degeneration, 
convergence and stability of the deformation process, myopia (i.e., use of image 
data only along the contour neighborhood), initialization, and model parameters 
estimation. References [3], [5], [11], [28] are illustrative examples of approaches 
to solve common problems with different snake techniques; 

1.2 Bayesian Framework 

Under the Bayesian viewpoint, the objective function referred above and its 
data and prior terms are interpretable as the posterior contour probability, the 
likelihood funetion associated to the observation mechanism, and the contour 
prior probability, respectively; since the sought contour maximizes the posterior 
probability, it is interpretable as the maximum a posteriori (MAP) estimate. 

In many imaging problems (e.g., medical imaging, synthetic aperture radar, 
synthetic aperture sonar) the likelihood function can be derived from the knowl- 
edge of the generation mechanism [7], [10], rather than from other heuristic and 
common sense arguments. A statistical framework is therefore, in these cases, 
the correct choice. 

Relevant advantages of the Bayesian approach are the following: 

(a) it allows to include prior knowledge about the parameters to be estimated 
in a model-based fashion; 

(b) it supplies an adequate framework for dealing with nuisance parameters (e.g., 
noise power, parameters distributions, blur coefficients). 

1.3 Prior and Contour Representation 

Contour representation and prior term, say Pc^ are close issues that have received 
great attention, regardless of the viewpoint. In snake- type approaches the term 
Pc is, typically, of the form 



where R{c) measures the smoothness of the contour c. Usually R is the combi- 
nation of norms of different derivatives [16]. In the Bayesian approach, Markov 
random fields have been used as a way of modelling contour smoothness [7,10,13]. 

The prior contour information can be imposed by appropriate selection of func- 
tion Pc and/or by introducing constraints on the set of admissible contours. For 
example, functional R in (1) can be tailored in order to have continuous deriva- 
tive contours. Another possibility is to find the solution in a constrained space; 




( 1 ) 
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one can assume, for example, that contours belong to a parametrized family; 
i.e., c(t) = c(t; a), where ol is defined in a given set 0. This is the case of the 
deformable parametrized models/templates (e.g., Fourier [11], spline [2], and 
wavelet descriptors [4]). 

1.4 Proposed Approach 

Herein we address contour estimation under the Bayesian setting. We assume 
that images are piece- wise homogeneous random fields, and that contours are 
the boundaries of open connected sets. 

Likelihood Function 

The likelihood function is derived from the image generation mechanism. We 
assume that pixels within each homogeneous region are independent samples of 
a selected random variable. For example, coherent amplitude images (e.g., ultra- 
sound and synthetic aperture radar and sonar images) are Rayleigh distributed 
[26], X-ray images are very well approximated with a Gaussian distribution [19], 
and nuclear and confocal microscopic images are Poisson distributed [22]. 

We take as hypothesis that the random variables associated with the image 
pixels are independent, i.e., we assume the so-called conditional independence 
property [14]. In an image system, this is a correct assumption if the resolution 
volumes contributing to different pixels are disjoint. This is approximately the 
case in most acquisition systems, since there is no information gain in acquiring 
extremely correlated neighboring data. 

Prior 

We assume that contours belong to a finite- dimensional subspace spanned by a 
given vector basis. Smoothness properties of contours are closely related to those 
of basis vectors and to the subspace dimension K [6] . 

Roughly, the basis dimension determines the frequency content of contours. 
What should then be a suitable subspace dimention K1 From the error projection 
point of view, K should be as large as possible. However, as K increases the 
subspace becomes less constrained and the estimated contours more noisy. 

We tackle the estimation of the subspace dimension by assuming that con- 
tours c(t; ol^K) are random, with probability density function of the form 

Pc{c{t-,a,k)) = pK{k). (2) 

The density pk is chosen to be a decreasing function of A, thus favoring smooth 
contours. The exact structure of pk is derived with basis on the estimate good- 
ness. 

The maximum a posteriori probability (MAP) estimation criterion is adopted. 
To solve the optimization problem one is led to, a gradient projection type algo- 
rhitm is used. 

Dealing with contours as subspace elements is very appealing, namely due to 
the following: 
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(a) it is a parametrized approach: given a subspace basis, a natural parametriza- 
tion is the set of basis coefficients, which are normally much smaller than 
the basis dimension; 

(b) given a generic contour c G S', the closest contour of c in a subspace of S is 
given by the projection of c onto this subspace. 

Fourier descriptors [11], B-splines [2], and wavelets [4] have already been 
proposed in the field of contour estimation. However, only work [11] explores the 
vector space perspective; namely, it introduces a minimum description length 
(MDL) [20] type principle for the determination the subspace dimension. 

The MDL criterion, as applied in [11], is a smoothing criterion depending 
only on the subspace dimension K. The smoothing criterion herein proposed, 
besides depending on LC, depends also on the Kullback distance between neigh- 
boring densities. This modification plays a key rule in assuring that estimated 
space dimension is, to a great extent, independent of the neighboring densities 
parameters. 

The mais contribution of this work are the following: 

(a) the study of the adequacy of subspaces generated by B-splines, Sine-type, 
and Fourier bases to the smoothness contour modeling; 

(b) the introduction of a criterion for the subspace dimension estimation based 
on the estimate robustness; 

(c) the proposal of a complete adaptive scheme that iteratively estimates the 
contour, the distribution parameters, and the subspace dimension. 

The paper is organized as follows. Section 2 addresses aspects of contour rep- 
resentation using B-splines, Sine- type and Fourier, bases. Section 3 proposes two 
algorithms for contour estimation: the first assumes that the subspace dimension 
is know; the second estimates the subspace dimension jointly with the contours. 
Finally Section 4 presents results obtained with real data. 

2 Contour Representations and Subspaces 

Contours are closed periodic curves c{t) = {x(t),^(t)}, such that c{t) = c(t-l-T). 
For notational convenience, assume that contours are defined in the complex 
plane C, and, therefore, x{t) and y{t) are the real and imaginary parts of c, 
respectively. 

We assume that c G L‘^{T) (contour power over its period is finite). Since 
L‘^{T) is a separable Hilbert space [17], then there exist bases {ipn{t)}, for 
n = 0,1,..., in such that each vector c G L‘^{T) is given by the lin- 
ear combination ^ 

c{t) = '^an(Pn{t). (3) 

n=0 

For an orthoganal basis coefficients an are unique and given by 

1 

an = (c, ifn) = dt. 



(4) 
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Equality (3) is to be understood as a limit in norm. 

In the proposed approach, contours are chosen to be elements of the sub- 
space Sk = span • • • 5 generated by the vector basis for 

n = 0,l,...,i^ — 1. Each contour is then written as 

K®1 

c{t;a) = ^ an(Pn{t)- ( 5 ) 

n=0 

where a = {^o, . . . , 

The contour smoothness constraint is enforced by adequate selection of the 
basis {(fn} and of the subspace dimension K. In this work we consider B-spline 
[6], Sine-type, and Fourier bases. 

2.1 B-splines Basis 

Spline functions are piecewise polynomials [6], which have been widely used 
to represent contours and surfaces in computer graphics, computer vision, and 
signal and image processing [8], [9], [25], [24]. Given the set of so-called knots 
{to < ti < . . . < t/c} C 5R, a m-order spline is a piece- wise polynomial function 
defined on [to,t/e], which are continuous on [tm,^/c0m]- Given a knot se- 
quence, the set of all splines which are continuous on [t^, tk^rn] is a linear 

space of dimension {k — n). The family of so-called B-splines functions, generated 
by the Gox-deBoor recursion [6], is a basis for this linear space. For equispaced 
knots, the B-spline are named uniform, and given by 

= (6) 

where index i denotes the i-th basis element, Tg = — ti, and 

= (7) 

' V ' 

m convolutions 

with 

^0 [ 1 ti t 

^ 1 0 otherwise. 

Since we are interested in representing periodic curves, splines and their B- 
spline basis must be modified accordingly. For this purpose, define {tn^n ^ Z}, 
with in = ^^mod/c’ periodic extension of the knot sequence {to < t\ < 

. . . < t/c} [12]. The basis functions B{^{t) are now periodic extensions of 
with period T = t/c — to, given by 

D 

BT{t)= Y. BT+nAt)- ( 8 ) 

n=0D 

When using the spline representation, we assume that contours are elements 
of the subspace Sk = span (S q^, and, therefore, continu- 

ous; the degree of smoothness is enforced by the subspace dimension K: as the 
subspace dimension increases, contours becomes less constrained. 
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B-splines exhibit local control: when representing curves as linear B-spline 
combinations, modifying a coefficient causes only a small part of the curve to 
change. This leads to simple an effective algorithms for computing displacements 
of the active contour under the influence of image forces. 

In all examples herein presented we use m = 3. The spline contours are 
therefore continuous. This a common choice in vision and computer graphics 
[9]. Nevertheless, the concepts to be presented apply to any m-order spline. 



2.2 Sine-type Basis 



A natural way to impose smoothness is to constrain the curves to be F-band- 
limited (i.e., having maximum frequency F). The set of F-bandlimited curves 
of finite energy is a linear space; the sequence {Sn{t)} = {So{t — nTg)}, where 
Tg = 1/2F and 

So (t) = sine (2Ft), (9) 

ZlTr t 

is an ortho normal space basis. The projection of a curve c(t) on Sn(t) is exactly 
c(nTs) (see, e.g., [27]). 

In order to adapt basis elements Sn{t) to periodic curves, one should have 
|5o(t)| 0 for |t| > T/2. Since So{t) goes to zero as for |t| ^ oo, this 

might not be fulfilled, if K — T/Tg is too small. To overcome this difficulty, we 
replace the basis function (9) with 



Soil) = 



sinZirFt cos(27rFtp) 
ZirFt 1 — {4pFt)‘^ 



( 10 ) 



Basis (10) is the impulse response of a raised cosine filter with a roll-off factor 
p [15], which goes to zero as for |t| ^ oo. Seting, for example, p = 0.4, we 

can take, for most practical purposes, So{t) 0 for |t| > ST^. 

The basis {5n(t)} = {5o(t — nTs/(l + p))}, with5o(t) given by (10) generates 
the space of F(1 + p)-bandlimited functions. Therefore the smoothness of space 
elements is enforced by selecting F. Since the sampling interval Tg/{lF p) must 
be equal io T j K (i.e., an integer number of bases over T), the relation between 
F and K is 



1 K 
2(1 + P) T' 



( 11 ) 



The periodic extension S^{t) of <S^(t), with period T, is given by 



D 

snt) = Y. ( 12 ) 

n=(8)D 



When using the bandlimited representation, we assume that contours are el- 
ements of the subspace Sk = span (5 q, <Sf^, . . . , 5x(g>i); the degree of smoothness 
is enforced by choosing K, which determines the maximum content frequency of 
contours according to (11). 

As the B-spline basis, also the Sine- type function (11), exhibits local control: 
the energy of Si{t) is concentrated at t = iTs. 
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2.3 Fourier Basis 

The Fourier orthonornal basis is, probably, the representation most often used 
for periodic functions. In this representation the basis elements are given by 

neZ. (13) 

With the Fourier representation, the most natural way of imposing smooth- 
ness is to restrict L‘^(T) to the finite subspace Sk generated by for 

n = — i^ + l,...,0,...,iC — 1. The generated contours are then given by 

K(g)i 

c{t) = ^ (14) 

n=(g)K+l 

Subspace Sk is obtained by filtering L‘^{T) elements with an ideal low-pass 
filter of cut frequency K/T. As in the Sync- type basis, the smoothness degree 
is enforced by selecting the maximum contour content frequency. Contrarily to 
the B-spline and Sync-type representations, the Fourier basis does not exhibit 
local control. 

2.4 Contour Sampling and Fitting 

Due to the discrete nature of digital images, one frequently faces the problem of 
finding, in a given subspace Sk^ the closest element of a set of discrete points. 
In other words, given c = {c(to), c(ti), . . . , find c G C^, such that 

c = arg min ||c-w||, (15) 

w(w), wu Sk 

where w = {re(to), . . . , w{tN^i)}^ 

In the fitting problem at hand, the set and the period T 

are not known. Herein we take T = N and ti = i, for i = 0, 1 . . . , A/" — 1, which 
is termed the uniform assignment strategy. 

Define matrix B such that 

[B]ij =(fj{ti), i = 0, 1,. V - 1, i = 0, 1,. - 1, (16) 

where (pj is one of the basis functions (8), (12), or (13). 

In terms of matrix B, minimization (15) is written as 

c = arg min lie — wll, (17) 

where 7^(B) stands for the span generated by the columns of B (notation B/., 
when used, stresses that k = dim(span(B/c)). Using the Euclidian norm, and 
assuming that K < N — the projection (17) is given by 

c = BB^c, (18) 



with 



B# = (B^B)^^B^ 



(19) 
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being the pseudoinverse matrix of B [21]. Matrix also solves the following 
problem: 

0 = arg min lie — Ball (20) 

= B#c, (21) 

being, therefore, c also given by = B0. 




Fig. 1. (a) Projection of a noisy contour onto the subspace generated by Fourier 
basis. Stars represent the discrete contour to be projected, whereas doted and 
solid lines represent the projection onto T^(Bio) and 7^(B3o), respectively, (b) 
Representation error for B-spline, Sine-type, and Fourier bases. 



Fig. 1(a) shows the projection of a hand traced contour contaminated with 
white noise, on the subspace generated by the Fourier basis. A complex zero- 
mean Gaussian random variable with standard deviation of 0.1 was added to 
each coordinate; stars represent the discrete contour to be projected; doted and 
solid lines represent the projections onto T^(Bio) and 7^(B3o), respectively. Pro- 
jection onto T^(Bio) is clearly underfitted, while projection on 7^(B3o) is nearly 
optimum. This can be perceived from the error projection plotted on Fig. 1(b). 
The minimum error occurs, for the three representations, roughly oX K = 30. For 
large values of the representation error increases, as the respective subspaces 
are now unable to smooth out the high frequency components of noise. 

The similarity between the three representations, at least for the example 
presented, is evident. However, we would like to call attention to the following 
point: the representation error on the subspaces generated by Sine- type and 
Fourier bases decreases until it reaches a minimum. This is not the case with 
the B-splines basis: the representation error, in this latter case, may increase, 
although slightly, with K. This behavior results from the non- nested structure of 
subspaces generated by the B-splines, whereas the subspaces generated by Sine- 
type and Fourier bases are nested: the linear space of F-banlimited functions 
contains all subspaces of VF-bandlimited functions with W < F. 

Subspaces having nested structure might be a desirable feature when the 
space dimension is unknown and it should be somehow estimated. 
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Fig. 2. Contour displacements computed along orthogonal lines. Crosses show 
the maximum of the loglikelihood function, along each orthogonal line. The doted 
line denotes the initial contour. 

3 Image Generation Model 



Let c = {co, . . . , cat 0 i} be the boundary of a connected region Ri of the plane 
and R2 the set of points not in Ri . Denote Xi as the image gray- level observed at 
Tth pixel, X = {xi} as the set of image gray- levels, Px as the gray- level density, 
and = {'0i,'02l as the density parameters (i.e., Px{xi) = Px{xi\^|Jl)) for 
i G Ri and Px{xi) = Px{xi\^|J 2)) fori G i?2)- Since we take as hypothesis that the 
image random variables, conditioned to the contour, are independent, it follows 
that 

Px*(xic, V’o,) = n ) n • (22) 

\iURi / \iUR2 / 

According to the proposed approach, contour c belongs to the subspace 7 ^(Bx), 
being therefore given by c = B^a, for a G . Subscript K will occasionally 
be omitted. 



3.1 Bayesian Approach to Contour Estimation 

In accordance with the rationale already exposed, we assume that contours 
c{K) = c(A, a) are random vectors with probability density function given 
by 

Pc{c(k)) = pK{k\ip^), (23) 

where - 0 ^ denotes a parameter vector of pk- Hence, the MAP estimate of the 
pair (c, K) is 



{c,K) = avg max Px*(x|c, V’a,)PK(^IV’c)- 

k,cU R,{Bk) 



( 24 ) 
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3.2 Known Space Dimension 

Consider now that K is know. The MAP estimate (24) is, under this condition, 
simply the maximum likelihood (ML) contour estimate given by 

c = arg max Lx*(x|c, ip^), (25) 

cDTe(Bfc) 

where Lx 4 b(x|c, -0^) = logPx 4 b(x|c, -0^) is the loglikelihood function. 

To compute c, we implement an ascent type iterative algorithm that, in the 
t-th iteration, implements the following steps: 

1. determine, in the unconstraint space C^, a contour displacement that 

increases Lx 4 b(x|c, -0^); 

2. project + Ac^^^ onto the constrained subspace 7^(B), thus obtaining 
cd+i). 

The displacement Ac^^^ is computed along orthogonal lines, as schematized in 
Fig. 2. Underlying this choice is the fact that the gradient of Lc 4 b(x|c), computed 
with respect to c, is orthogonal to the tangent vector dc/dt [23]. 

To prevent contours to be self intercepting, the orthogonal lines should be 
not too large. Work [23] proposes a technique for selecting long range orthogonal 
curves that do not intercept each other. Herein, however, we do not follow the 
mentioned technique, since it is not suited to our setting. 

When the vector -0^ is not known, we determine ML estimate of vector 
ice'll ) according to 

(c,V’^)=arg max ix*(x|c, (26) 

To compute (c, given by (26), the following iterative scheme is imple- 
mented: 

Initialization: set \ ^ 

DO 

step 1: Ac^^^ = arg max I/xa(x|c^^^ + 

uD 0(c(t)) 

where 0{c) C is the set of points defining 
orthogonal displacements to the contour c 

step 2: 

step 3: = argmaxLx*(x|c(*+^\-0a:) 

step 4: AL = 

While \AL\> 5. 



Vector can be written in terms of regions R\ and i ?2 as 

-0W=argmax y] 



V’l 



iW 



( 27 ) 
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Fig. 3. Sequence of contour estimates produced by the proposed technique. The 
inner square represents the initial contour. 



Expressions (27) and (28) depend on the particular structure of Px- For 
example, for Gaussian densities with mean p and variance cr^, estimates of 
'll)- = for i = 1,2, are given by the sample mean and sample variance 

within the respective region. 

Fig. 2, parts (a) and (b), displays estimates and respectively, of a 
Gaussian image with parameters {/ii = 60, ai = 15} and {ji 2 = 160,(72 = 30}. 
The boundary is obtained from a hand traced contour followed by projection 
onto 

Fig. 3 shows a sequence of contours estimates produced by the proposed al- 
gorithm. The long range nature of the external forces pulls the contour outwards 
as it was under an expansion force. 

Fig. 4, part (a) and (b), displays two final estimates of Gaussian images with 
parameters = {(/ii = 80, (Ji = 15), = 160, (Ji = 30)} and = {{pi = 

100, (7i = 15), (/i 2 = 100, (7i = 30)}. The estimated contours are nearly the true 
ones, even for image (b), which exhibits no contrast at all (i.e., = p2). 

3.3 Unknown Space Dimension 

Gonsider now that the space dimension is unknown and, consequently, it is also 
to be estimated jointly with the contour. Noting that c = c(i^), and according 
to expression (24), the MAP estimate of the space dimension is given by 




(28) 



^ = argmax Lk(/c|' 0^) + arg max Lx 4 b(x|c, -0^) > , (29) 




where LK{k\^|)^) = logpx(/c|'0^). The estimated contour is the ML solution 
studied in the previous section, with the space dimension set to K. 
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Fig. 4. Illustration of performance at very low contract: gray- levels in image (b) 
have the same mean value in both regions; in spite of this, the estimated contour 
is identical to the one estimated from image (a), which has high contrast. 




Fig. 5. Behavior of the loglikelihood function for two different vector parameters. 



As in any Bayesian approach, the prior term must be specified. The first 
thought that could come to mind is to assume that pK{k\^|^c) is uniformly dis- 
tributed for Kmin ^ K < Kmax] the estimate of {c^K) would therefore be inter- 
pretable as a ML estimate. Unfortunately, this attempt would fail. The reason is 
the following: due the nested nature of parameter spaces {1Z(Bk) C 
the loglikelihood function Lx4b(x|c(^), -0^) will be a monotonically (or at least 
nondecreasing) function of K, so it will reach its maximum at Kmax- 

The problem of choosing the order of competing models of different dimen- 
sions is termed a model order seleetion problem. Among the approaches that 
have been suggested to this problem, the Akaike information eriterion (AIC) 
[1], and the minimum deseription length (MDL) [20] have gained popularity. 
Work [11], also on contour estimation, applies the MDL principle to derive the 
term Lx(/c|'0^) of (29). 
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Fig. 6. Loglikelihood and prior behavior for the images shown in Fig. 7. 



In this work we propose the prior 

PKim,) = a > 0, (30) 

where Z is a normalizing constant, and a is given by 

71 

a = (31) 

where ric is the number of contour pixels, /r 0.1, and 

D(V’l, V>2) = D(V’i||V’ 2) + W2||V’l), (32) 

is the symmetric Kullback distance [18] and E)('0i||'02) the Kullback distance [18] 
between densities Px{xi\'il^ 2 ) and Px{xi\'il^ 2 ) given by 

= (33) 

The derivation of prior (30) is out of the scope of this paper. We present, 
however, an informal justification. Aiming at this purpose, define 

Ip) = LK{k\ip^) + arg max £x*(x|c, ^pJ. (34) 

cU 7 c(Bk ) 

Define also the sets A 12 and A 21 containing pixel indexes wrongly classified: in 
the first case region 1 has been detected, whereas in the second case region 2 
has been detected. Assume that the true dimension space is ko and introduce 



Ip) - ic*(fco|x, Ip). 



( 35 ) 
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The difference ALf.^{k) can be written in terms of Ai 2 {k) and A 2 \{k) as 



AL^{k) = log 



iW Ai2{k) 



Px{Xi\lpl) 

Px{Xi\lp2) 



sr^ 1 Px{xi\ip2) .AT- /,x 

Y ^ I./. ^ +ALK{k), 



iW A21 (k) 



' Px(Xij-l/’l) 



where 



ALK(k) = LK(fc|tAc) - ix(fco|'0c)- 
A lengthy manipulation of (36), and a few weak assumptions, lead to 



(36) 

(37) 



E{ALcJi^{k)\\ko} = -ni2(/c)T>('0i,'02) + ALx{k), (38) 

where riij (k) = 4t^Aij (/c) is the number of elements of Aij . 

The interpretation of (38) is clear: term n\ 2 {k)D{'il)x^'^ 2 ) tends to zero as 
the the number of missclassified pixels tends to zero. The vanishing rate is pro- 
portional to the symmetric Kullback distance U('0i,'02)- 

Fig. 5 schematizes the behavior of the loglikelihood function Lx 4 b^|c, -0^) 
and of the prior term Lx(^|'0c)? vector parameters -0^ and When 

riij{k) approaches zero, Lx 4 b(x^c, -0^) approaches a constant. By adding an ap- 
propriate prior term LK{k^'i!)^ to the loglikelihood function, a maximum is 
obtained at k = ko. For the second vector parameter -0^, the increasing rate of 
I/x 4 b(x|c, -0^) is slower than Lx 4 b(x|c, -0^). If the prior term Lx{k^ -0^) was used, 
a maximum would be obtained for k < Kq. 
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Fig. 7. Magnetic resonance images: (a) heart and (b,c) brain. 



Fig. 6 displays the loglikelihood and prior behavior for images shown in Fig. 4: 
the left column plots data from part (b) , while the right column plots data from 
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part (a). For both cases the maximum of is obtained for k = 7. However, 

the loglikelihood function Lx 4 b(x|c, grows much slower than Lx 4 b(x|c, -0^), 
approximately by a factor of 12, determined in the interval k G {2, 3, 4, 5}. For 
the maximizer k to be the same, the prior term Lx(^,'0c) ^^^st grow slower 
than -0^) by the same factor. This is, with great approximation, what 

happens. 

The symmetric Kullback distance, for Gaussian distributions, is given by 






2cjIcjI 



(39) 



Noting that the parameters associated with images displayed in Fig. 6 are -0^ = 
'0a^(a) = {(/ii = 80,(71 = 15), (/i 2 = 160,(72 = 30)} and = {(/ii = 

100, (7i = 15), (/i 2 = 100,(72 = 30)}, it follows that '02)/^('0i5 V^ 2 ) — 

in accordance with the experimental results. 

Fig. 7 shows estimated contours over real magnetic resonance images: (a) 
heart and (b,c) brain. The Gaussian model and the Fourier basis was used. The 
estimated space dimensions are 4, 6, and 7, which are in agreement with the 
contours frequency content. 

We stress the methodology robustness with respect to image nonhomogenei- 
ties and poor contour initializations. 



4 Concluding Remarks 

This paper introduced a novel adaptive methodology to contour estimation from 
noisy images. The approach was Bayesian: images were modeled as as a set of 
homogeneous regions, in a statistical sense. Gontours were assumed to be vectors 
of a subspace generated by a finite basis: B-splines, Fourier, and Sine-type bases 
were studied. It was concluded that Fourier, and Sine-type bases were better 
suited to the proposed technique due to its nesting property. 

A relevant contribution of the paper was on the contour prior design. By 
parametrizing the a priori probability density function with the symmetric Kull- 
back distance between densities of each homogeneous region, the proposed algo- 
rithm produces meaningful estimates. 

The proposed scheme is completely adaptive; all model parameters are esti- 
mated jointly with the contour. Results obtained with simulated and real data 
show the adequacy of the proposed approach. 
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Abstract. It is of practical importance to fuse data obtained by mul- 
tiple sensors for improving the performance of computer vision systems. 

This paper introduces an algorithm for pixel-based data fusion on the 
variational framework. An adaptive system fuses data effectively using 
a variational technique. Previously, we have introduced a technique to 
fuse gray-scale image and texture extracting features for segmenting an 
image with both textured and non-textured surfaces. This paper extends 
the study for more general multi-valued data and improve the previous 
algorithm in terms of the performance and speed. 

Keywords: segmentation, data fusion, color, texture, variational method 

1 Introduction 

There has been an immense amount of interest in formulating various computer 
vision problems in the energy minimization framework [1,7]. One of such prob- 
lems is segmentation or boundary detection Models such as weak membrane 
model (WMM) and Mumford-Shah model have gained much attention recently 
for segmentation under noisy circumstances [2,17,18]. It describes often conflict- 
ing terms of data fidelity and data smoothness along with a binary process to 
introduce discontinuities in the smoothness measure. 

With this method, the objective becomes finding the global minimum of the 
energy function such as 



where a, /i and u are weight parameters and || • || is a L2-norm. The function 
g{x,y) represents data collected by a sensor at the coordinate (x, y) and often re- 
ferred to as observation^ s{x,y) is called surfaee proeess and represents a smooth 

* Research partially supported by ONR Grant N00014-97-1-1163 and ARO Grant 
DA AH04-96- 10326 

^ In this paper these two terms are considered to be interchangeable, although some 
claim that they should be treated differently as the former is region based and the 
latter is edge based. 
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surface constrained by the observation, /(x, y) is a binary process and called line 
process. The line process represents existing/non-existing of a boundary at (x, y) 
with / = 1// = 0, respectively. The first term is called fidelity^ second term is 
stabilizer and the third term is penalty. 

Through minimization of the energy function, the surface process maintains 
closeness to the observation, and develops into a smooth surface within a bound- 
ary where / = 0. The surface process is allowed to be discontinuous at boundaries 
where / = 1. The penalty term penalize having / = 1 so that it prevents I from 
being 1 everywhere. As n increases, the number of boundary pixels in the result 
decreases. 

The purpose of this paper is to extend this well-known model for multi-valued 
data where the data can be obtained from multiple sensors or from multiple 
feature extractors. We assume that each sensor or feature extractor is aligned so 
that no registration is required on the obtained data. This type of data fusion 
problems are often referred to as pixel-based fusion [16]. Thus, each pixel has 
a corresponding vector representation where each vector element is the value 
measured by each sensor or extracted by each feature extractor. Throughout the 
paper, each vector element is referred to simply as feature. 

A key to a good boundary detection process is to detect discontinuities in 
spatially laid out data field. Each sensor or feature extractor is designed to 
capture only certain surface properties, thus can only detect certain surface 
boundaries. In order to build a reliable automated boundary detection system 
for general environments, it is important to have a mechanism of selecting a 
subset of sensors adaptively from region to region. 

Previously, we have reported a fusion technique for boundary detection using 
gray-scale pixels and texture features. [10]. The technique combines the features 
by a set of normalized weights and it minimizes an energy term similar to WMM 
with respect to the weight to obtain a near optimal set of weights at each feature 
location. This paper gives another interpretation of the algorithm and proposes 
a new computational techniques. It also presents more extensive segmentation 
experiments on both synthetic and natural color images. The new technique 
is computationally cheaper than the previous one and gives better boundary 
representations in various experiments we have performed. 

The paper is organized as follows. Section 2 describes our algorithm. It pro- 
poses an energy model based on the WMM and two different computational 
approaches for the minimization process. Section 3 provides various experimen- 
tal results for color image segmentation. Section 4 provides experimental results 
for texture segmentation. Section 5 gives brief summary and conclusions. 

2 Algorithm 

2.1 Model I 

The first model is the one reported in [10]. The main idea is to combine multiple 
features by normalized weights and to minimize the energy function (1) with 
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respect to the weights to obtain a set of near-optimal weights. The energy is also 
minimized with s and I using some conventional method. 

A rationale of this approach is to trust smooth data while disregard rough 
data assuming that the measurements be smooth within region boundaries. The 
quantity of A or the amount of trust on the data is measured locally resulting in 
spatially variant adaptation of the system. 

The idea of estimating parameters by minimizing an objective function is 
not new and has been studied in different perspectives [3,20,14]. In [20], the idea 
was applied for obtaining a set of compatibility coefficients for relaxation label- 
ing problems. Region based segmentation algorithms often employ Estimation- 
Maximization (EM) strategy where segmentation (Estimation) and Maximum 
Likelihood parameter estimation (Maximization) are performed iteratively until 
convergence [3,11]. Some heuristical approaches have been proposed as well. In 
[13], WMM incorporates Gabor features where the features are arranged in the 
Gabor 4D space (x, y, scale and orientation). An appropriate mixture is com- 
puted through diffusion in the 4D feature space. Our interest is to apply the idea 
of the relaxation labeling to WMM for boundary detection on multi-valued data 
field. 

The weights are denoted as and satisfies the following constraints: 

^Ai = l, Ai>0. (2) 

i 

Then the weighted features and surface processes are 





9i — ^i9i 1 


(3) 




II 


(4) 


The total energy 


of the model is 






E = E \\si-9if Si -O + ul 

i 


(5) 


If we assume 




(6) 


then 

E 


= E + A?Mi||Vs,f (1 - 0} + ■ 


(7) 



In order to ensure the assumption (6) to be valid, an additional constraint 

l|VV||^(l-0 = 0 (8) 



is added. 

It is desirable for the solution of the minimization process to be invariant to 
redundant features which can be either duplicates of already included features 
or features with no useful information such as white noise. A problem with the 
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above model is that the fidelity and stabilizer are dependent on the number of 
features since 1/M < A? < 1. Thus adding redundant features can reduce the 
influence of the fidelity and stabilizer terms relatively to the penalty. To achieve 
the invariance, v can be made dependent on {A^} as well by 

U = (9) 

Now the objective is to minimize 

E = ^ Aflls, - gif + A?Mi||Vs,f (1 -l) + vl. ( 10 ) 

with 

^Ai = l, Ai>0, ||VA,||"(l-0=0 (11) 

Solving the Euler-Lagrange equation gives the update rule for Si as 

rj q . 

^ cx -Xh\si - 9i\\ + V • (miA?(1 - l)Vsi) (12) 

The update rule for I using the mean field annealing is [6] 



SiU^)/T ^ ^ 

where T is the temperature for the annealing process. 

Initially, A^ are all set to 1/M. With some prior information available, this 
initial condition can be biased according to the prior, which may improve the 
system performance. The update process for A^ takes two steps. First, Xi is 
updated individually by minimizing 

^ = E - 5*11' + A?Mi||Vsi||2(l -l) + i)l + M2||VA,||2(1 - 1). (14) 

i 

Its update rule is 

^ (X -Aillsi - 9if - AiiAi||Vsi||2 + V • (Ai2(l - OVA^) - XiPl. (15) 

Second, {A^} is normalized to meet the hard constraints of (2). In order to 
assure the positivity, any A^ < 0 is set to 0 implying that Si is not useful at the 
location. Then As are normalized by 

Ai = (16) 

l^j Aj 

to ensure they sum up to 1. Thus the whole process of updating A is similar to 
the probabilistic relaxation ([21]). A special care has to be taken when all As are 
0. In that case, they are set back to 1/M as the initial condition. However, this 
condition never happened in our experiments, mainly because the descent rate 
of Xi is proportional to Xi as seen in (15). As Xi approaches to 0, the rate of its 
change decreases. 
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2.2 Model II 

It can be seen easily that the energy equation, (10), can be rewritten as 



where Ei is the energy quantity (1) associated with the ith feature. The above 
equation suggests that A can be interpreted as a normalized weight for E and 
not for s. The significance of this interpretation is that the dynamics of Si is no 
longer dependent on Xi and one can simply apply the pixel-based data fusion 
at the energy level instead of the feature level. The constraint (8) is no longer 
necessary for the formulation of (17). However, it enforces smoothness in A and 
decreases the system’s sensitivity to noise. 

With this model, the update rule for Si is 



The update rules for I and A^ are the same as (13) and (15), respectively. 

Note that difference in Model I and II are not the energy model but the 
computation or minimization process. Both have the identical energy landscape 
as (10) and (17) are identical. Model I assigns a spatially variant diffusion speed 
for Si where the speed is proportional to Af. On the other hand. Model II allows 
each surface process, 5^, to diffuse at the maximum speed independently from 
Xi. Therefore, as will be demonstrated in the next section. Model I does not 
diffuse areas with high data fluctuations when another feature does not contain 
fluctuations in the region. This non-diffused areas will form separate regions by 
themselves as the energy minimization process converges. 

As far as the amount of computation is concerned, this model is simpler than 
the first model since the update rule for Si is simpler. With our implementation. 
Model II saved approximately 20% of computation time over Model I. It took 
approximately 300 and 400 seconds of CPU times to run 256x256 RGB color 
image segmentation on an SGI O 2 system with Model I and II, respectively. The 
same experiment took 220 seconds with WMM. 

Model II can also be interpreted as the WMM energy with a modified feature 
gradient strength. With this interpretation, the gradient strength of the feature 
vector is computed as 



Thus it can be considered as an adaptive gradient based segmentation technique. 
Similarly, {Af} can be considered as a row of spatially variant linear transforma- 
tion matrix. 

Another alternative for the energy model is 




(17) 



d s ■ 



(18) 




(19) 



E = Y,\Ei. 



( 20 ) 
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This is a different energy model from (10). With /i 2 = 0 (i.e. without any spatial 
constraints on A^), (10) or (17) will yields 



Thus, the former allows continuous values where \i/\j = EijEj. The latter is 
a discrete system with a winner-take-all strategy. Both energy models produce 
very similar results in our experiments and (20) has a slightly less amount of 
computation compared to (17). The results presented in the following sections 
are produced by using the winner-take-all model (20). 

3 Color Image Segmentation 

The two algorithms described in the previous section are tested on RGB color 
images. 

Problems of segmenting color images have been studied from various differ- 
ent perspectives and many algorithms have been proposed in literature. They 
can be categorized into 1) histogram based 2) clustering based [8,23], and 3) 
Markov Random Field based techniques [15,22,19]. Techniques of the first class 
explicitly use histogram of each color component to separate foregrounds from 
the background. It is simple but does not work well on natural images. Tech- 
niques of the second class apply some clustering techniques in the transformed 
color space. Spatial information can be incorporated by adding pixel coordinates 
as features. Techniques of the third class incorporate both data proximity and 
spatial organization into Bayesian classification framework. Our scheme can be 
categorized into the third class as the WMM energy model can be linked to the 
Markov process through the Gibbs distribution [7]. 

Three test images were created for this experiments, each simulating a dif- 
ferent type of noise sources. Figure 1 shows the test images. The columns in the 
figure show the luminance, R-band, G-band and B-band of RGB color images, 
respectively. The first one simulates noisy sensors in disjoint regions. The signal 
to noise ratio inside the defective regions is 0.425. The second image has noise 
everywhere on one of the sensors (Blue channel). It simulates a channel noise 
or the case where one of the sensor is ineffective in capturing information of 
the environment. The signal to noise ratio in the blue channel is 0.5. The third 
image has noise everywhere on all channels. It simulates system noise or a noisy 
environment. The signal to noise ratio of the image is 1.275. As seen in the 
figure, the luminance by itself does not contain enough information to extract 
boundaries reliably. 

First, an input RGB image is normalized to make it zero mean and unit 
variance. This normalization allows the segmentation system to be independent 




( 21 ) 



as a solution. On the other hand, (20) will yield 



{ 



1 if Ei = mirijEj 
0 otherwise 



( 22 ) 
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of the contrast of the input image. Figure 2 is the results of applying boundary 
detection without updating A^. Thus, each feature is treated equally. It shows 
that the technique cannot extract the ground truth boundaries without picking 
up spurious edges. It performed well on the third case where all channels are 
equally noisy everywhere. In such cases, no weight adaptation is necessary. 

Figure 3 is the results of applying the Model I to the test images. The tech- 
nique could delineate the ground truth boundaries without picking up spurious 
edges. For the first image, it delineated the noisy regions as separate regions. 
This is acceptable since these “noise” could come from different surface charac- 
teristics of the regions. The technique, however, did need some adjustment on its 
parameters. Figure shows that v = 0.03 was too big for Image 2 but was right 
for Image 3. 

Figure 4 is the results of applying the Model II to the test images. It de- 
lineated the ground truth boundaries in all test cases without any parameter 
tweaking. All the results were obtained with a = 0.05, /ri = 1.0, /i 2 = 10.0, and 
z/ = 0.01. 

The performance difference between Model I and II can be explained by 
examining the evolution of the surface processes. In Model I, the decent rate of 
Si is proportional to Af as described in Section 2. Within a noisy region where 
Xi is small, Si evolves very slowly. Eventually, this slow diffusion will allow the 
line process to form a boundary around the noisy region as the temperature, T 
in (13), decreases. The line process, however, does not form boundaries within 
the noisy region since the effective gradient strength, Ai||Vsi|p, is small due to 
a small Xi inside the region. Figure 5 compares the final surface processes for 
Model I and II. 

We have applied the algorithm to natural images. First, Figure 6 gives the 
results of applying the WMM, Model I, and Model II energy minimization pro- 
cesses to the input image. The value of /i 2 is changed to 1.0 for this image since 
some regions such as the tail wing is smaller than regions in the previous ex- 
periments and the smoothness constraint on A is relaxed. The results are very 
similar as the A image does not show much variation in this case. The reason for 
this is that three color bands in natural images tend to have similar characteris- 
tics, and regions with high variation in one band tend to have high variations in 
the other two bands as well. Note that a bigger /i 2 will make the variation even 
smaller. 

The amount of spurious edges within the terrain appear to be less for Model 
I and II than WMM. However, it is difficult to compare the quality of line 
processes quantitatively for natural images. 

Next, the same algorithm is applied to a microscopic image of pituitary cells. 
Here, the objective is more clear that each cell needs to be separated from the 
background. Due to a high level of noise, a in 1 is lower to 0.25. 

Although, the result of Model II appears to be less noisy, there is no signifi- 
cant improvements over the WMM. 
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4 Texture Segmentation 

This section presents some experimental results of our algorithm for segmenting 
images comprising of both textured and non-textured regions. It is often the 
case that segmentation algorithms developed specifically for textured images do 
not precisely detect sharp boundaries between non-textured regions and those 
developed for non-textured images are too sensitive to texture edges and fail to 
detect meaningful boundaries for textured regions. 

The idea here is to combine luminance value with features obtained by some 
texture feature detectors using the energy model, (17) so that the system can 
achieve a good trade-off between precise localization of non-textured regions and 
good classification of textured regions. 

We have employed simple texture feature extractors, namely Laws texture 
metrics, for our experiments [12]. The Laws texture filters are separable pairs of 
ID filters. Each ID filter can be one of the following 5 filters: 

L5 = { 1 4 6 4 1}/16, 

Eb = {-1 - 2 0 2 l}/6, 

5'5 = {-1 0 2 0 - l}/4, 

W5 = {-1 2 0-2 l}/6, 

Rb = {l -4 6 -4 1} /16. 

We have used 3 pairs, L5-E5, E5-L5 and E5-E5, for our experiments. After the 
original image is filtered by the 3 separable filter pairs to produce 3 feature 
images, each feature is replaced by its absolute value and each feature image is 
smoothed with a concentric Gaussian filter with its standard deviation equal to 
2 pixels. Einally, each texture feature is scaled by 2 to match the contrast of the 
original image. 

This simple and crude texture features are combined with the original pixel 
value to form a feature vector with 4 elements at each pixel location. 

Eigures 8 gives results of WMM, Model I and II on two synthetic textured 
images. In both test cases, WMM could not detect ground-truth boundaries with- 
out picking up spurious texture edges. Both Model I and II localized boundaries 
between textured- non-textured and two non-textured regions. Due to simple tex- 
ture feature extracting process, they could not reliably detect the boundaries be- 
tween textured regions. We are currently experimenting various texture features 
for a more reliable discrimination process. Some of features under consideration 
are Gabor features [9,4], Wavelet Eractal Signatures [5], and Ean filters [24]. 

5 Conclusion 

The paper described an energy model and its minimization techniques for de- 
tecting region boundaries in a multi-valued data field. The energy model is a 
modification of WMM where a set of normalized weight. A, participates in the 
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minimization process. Two minimization techniques, (referred to as Model I and 
II throughout the paper), were described. 

The technique is effective and improves the WMM when the noise charac- 
teristics of a multi-band image is different across the bands. It produces very 
similar result with the WMM when the noise characteristics are similar across 
the bands as demonstrated in our color image experiments. 

With the same reasoning, the technique is effective on a feature vector image 
as each feature has different spatial variation. 

The weight adaptation technique described in this paper does not take the 
signal level of each feature into account. For example, one sensor measures strong 
and uniform response in one region while the other sensor produced zero level 
signal in the same region. If the both responses are uniform across the region (i.e. 
Vs = 0) then the adaptation technique will weigh two sensors equally. However, 
intuitively, the sensor with the strong response should be trusted more than 
the other sensor. Thus, our future works include incorporating the measurement 
level into the adaptation criteria. 
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Noise Free Image 




Noisy Image 1 




Noisy Image 2 




Noisy Image 3 




Luminance ^ Band G Band B Band 

Fig. 1. Synthetic color test images. 1st row: the original noise free 
image. 2nd row: disjoint noise image (SNR=0.425). 3rd row: noise every- 
where in Blue band (SNR=0.5). 4th row: noise everywhere in all bands 
(SNR=1.275). 1st column: RGB images converted into gray scale. 2nd 
column: Red band. 3rd column: Green band. 4th column: Blue band. 
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v=0.03 



v=0.01 



v=0.01 




v=0.05 



v=0.03 v=0.03 



Fig. 2. Results of boundary detection using WMM. 1st column: the 
results for the noisy image 1 in Figure 1. 2nd column: the results for the 
noisy image 2 in Figure 1. 3rd column: the results for the noisy image 3 
in Figure 1. a = 0.05 and = 1.0. n needed to be adjusted as shown. 




Fig. 3. Results of boundary detection using Model I. 1st column: 
the results for the Image 1 in Figure 1. 2nd column: the results for the 
Image 2 in Figure 1. 3rd column: the results for the Image 3 in Figure 1. 
a — 0.05, /ii = 1.0 and fi 2 = 10.0. n = 0.01 for the 1st row and n = 0.03 
for the 2nd row. 
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(a) (b) (c) 

Fig. 4. Results of boundary detection using Model II. 1st column: 
the results for the noisy image 1 in Figure 1. 2nd column: the results 
for the noisy image 2 in Figure 1. 3rd column: the results for the noisy 
image 3 in Figure 1. All the results were obtained with the parameter 
set {a — 0.05, jii — 1.0 , 112 — 10.0 , and n — 0.01} 




(a) (b) (c) 




(d) (e) (f) 



Fig. 5. Comparison of surface process results. The figure shows in- 
termediate states of the surface process. The top row is the results of 
applying Model I on the test images shown in Figure 1. The bottom row 
is the results of applying Model II on the same test images. First, sec- 
ond and third columns are results of applying the algorithms on Image 
1, Image 2 and Image 3 in Figure 1, respectively. Note that this figure 
only shows the luminance of the RGB images. 
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Fig. 6. Results of boundary detection on a natural color image 
(Part 1 ). Top-left: the luminance of the input color image. Top-right: the 
line process of WMM. Middle- left: the line process of Model I. Middle- 
right: the line process of Model II. Bottom- left: the luminance of the sur- 
face process. Bottom-right: the luminance of the lambda process. All the 
results are computed with {a = 0.05, gi = 1.0, /X 2 = 1.0 , and n = 0.01} 
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Fig. 7. Results of boundary detection on a natural color image 
(Part 2). Top-left: the luminance of the input color image. Top-right: the 
line process of WMM. Middle-left: the line process of Model I. Middle- 
right: the line process of Model II. Bottom- left: the luminance of the sur- 
face process. Bottom-right: the luminance of the lambda process. All the 
results are computed with {a = 0.025, = 1.0, /X 2 = 1.0 , and n = 0.01} 
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Fig. 8. Results of boundary detection on textured image. Two sets of re- 
sults are shown here. Each set shows the original texture (top- left), the line pro- 
cess using WMM (top-right), the line-process using Model I (bottom- left), and 
the line-process using Model II (bottom-right). All the results are computed with 
{a — 0.05, — 1.0, /i 2 — 10.0, and n — 0.01} 
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Abstract. This paper develops a theory for the convergence rates of A* 
algorithms for real-world vision problems, such as road tracking, which 
can be formulated in terms of maximizing a reward function derived us- 
ing Bayesian probability theory. Such problems are well suited to A* 
tree search and it can be shown that many algorithms proposed to solve 
them are special cases, or variants, of A*. Moreover, the Bayesian for- 
mulation naturally defines a probability distribution on the ensemble of 
problem instances, which we call the Bayesian Ensemble. We analyze 
the Bayesian ensemble, using techniques from information theory, and 
mathematically prove expected 0(N) convergence rates of inadmissible 
A* algorithms. These rates depend on an “order parameter” which char- 
acterizes the difficulty of the problem. 



1 Introduction 

Recently, it has become apparent [25] that a class of real world vision problems, 
formulated as Bayesian inference [18], can be solved using A* algorithms. This 
class includes tasks such as the detection and tracking of paths in noise/clutter, 
see figure (1). In particular, it was shown [25] that many of the algorithms used to 
solve these tasks (see, for example, [20], [2], [10], [12], [11]) could be interpreted as 
special cases, or variants, of A* algorithms. Incident ly, a consequence of applying 
A* to Bayesian problems is that the prior probabilities, an essential component 
of the Bayesian approach, can be used to make stronger heuristic predictions 
than in standard A*, see [12], [28], which can result in improved performance. 




Fig. 1. The difficulty of detecting the target path in clutter depends, by our 
theory [27], on the order parameter K. The larger K the less computation re- 
quired. Left, an easy detection task with K = 0.8647. Middle, a harder detection 
task with K = 0.2105. Right, an impossible task with K = —0.7272. 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 189-204, 1999. 

[b Springer- Verlag Berlin Heidelberg 1999 
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An advantage of expressing algorithms in a uniform framework, such as A*, 
is that it enables us to do theoretical and experimental comparisons between 
different algorithms to determine which ones are most effective. Moreover, one 
may hope to identify characteristics of the problem domain which determine the 
difficulty of the search tasks independently of the algorithm used. If so, then it 
may be possible to design optimally effective algorithms to solve the problems. 
These are the issues that we investigate in this paper. (See also, our related work 
[27]). 

Broadly speaking, there are two strategies for evaluating the effectiveness of 
algorithms. The first is the worst case analysis used in much of computer science 
[9]. The second involves determining the convergence rates on typical problem 
situations (i.e. those which typically occur). This form of analysis requires having 
a probability distribution on the ensemble of problem instances. Karp and Pearl 
provided a fascinating analysis of binary tree A* search using this approach (see 
Chp. 5 [21], [15]). We argue that this second approach is of more relevance to 
the problems we are concerned with and so we will study it in our paper. Inter- 
estingly, however, there are some recent studies showing that order parameters 
exist for NP-complete problems and that these problems can be easy to solve 
for certain values of the order parameters [4], [24]. The connection between this 
approach and our own is a topic for further research. 

We emphasize that the Bayesian formulation of our problems naturally gives 
rise to a probability distribution on the ensemble of problem instanees^ which we 
call the Bayesian Ensemble. This allows us to build on the foundations estab- 
lished by Karp and Pearl [21] to obtain expected convergence rates. Technically, 
our proofs involve adapting techniques from information theory, such as Sanov’s 
theorem, which were developed to bound the probability of rare events occurring 

[ 7 ]. 



In particular, we formulate the problem of detecting target curves in clutter 
to be one of Bayesian inference [18]. This requires searching through a tree of 
possible target paths, see figure (2) for an illustration of this search task. This 
assumes that statistical knowledge is determined for the target and the clutter, 
as described in section (2). Such statistical knowledge has often been used in 
computer vision for determining optimization criteria to be minimized and tech- 
niques have been developed to learn it from real data [29]. We want to go one 
step further and use this statistical knowledge to determine good search strate- 
gies. In particular, we can prove that for certain curve and boundary detection 
algorithms we will obtain expected A* convergence rates by examining a number 
of nodes which is linear in the size N of the problem. In addition, the expected 
sort time per node is constant [6] (note that this does not necessarily imply that 
the expected time for the problem is linear in N). Moreover, our analysis helps 
determine important characteristics of the problem, similar to order parameters 
in statistical physics, which quantify the difficulty of the problem. These order 
parameters determine the constants in the convergence rate formulae and also 
determine the expected errors. 
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Fig. 2. A simulated road tracking problem where dark lines indicate strong edge 
responses and dashed lines specify weak responses. The branching factor is three. 
The data was generated by stochastic sampling using a simplified version of the 
models analyzed in this paper. In this sample there is only one strong candidate 
for the best path (the continuous dark line) but chance fluctuations have created 
subpaths in the noise with strong edge responses. The A* algorithm must search 
this tree starting at the bottom node. 



As we will show, our convergence bounds become infinite at certain values of 
these order parameters. Is this an artifact of our proofs? Or is it a limitation of the 
A* search strategy? In related work [27], we prove instead that it corresponds 
to a fundamental difficulty with the problem. As proven in [27] similar order 
parameters characterize the difficulty of solving the problem independently of the 
algorithm employed. Moreover, at critical values of these order parameters there 
is a phase transition and the problem becomes insolvable. These fundamental 
bounds show that our proofs in this paper only break down as we enter the regime 
where the problem is unsolvable by any algorithm. The A* algorithm remains 
effective as we approach the critical value of the order parameters although, not 
surprisingly, the convergence rates get very slow. 

The first section (2) of this paper describes the probabilistic formulation of 
road tracking that we use to prove our results. Section (3) introduces Sanov’s 
theorem and illustrates how it can be applied to bound the probabilities of rare 
events. In section (4), we analyze the case of inadmissible heuristics (the case of 
admissible heuristics with pruning was analyzed in [26]). We conclude by placing 
this work in a larger context and summarizing recent extensions. 

2 Mathematical Formulation of Road Tracking 

Tracking curved objects in real images is an important practical problem in com- 
puter vision. We consider a specific formulation of the problem of road tracking 
from aerial images by Geman (D.) and Jedynak [12]. Their approach used a 
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novel active search algorithm to track a road in an aerial photograph with em- 
pirical convergence rates of 0{N) for roads of length N. Their algorithm is highly 
effective for this application and is arguably the best currently available. In pre- 
vious work [25], we showed that Geman and Jedynak’s algorithm was a close 
approximation to A*. Other search algorithms such as Dijkstra and Dynamic 
Programming used in related visual search problems [20], [2], [10], [17]. [11], [5] 
can be shown to be special cases of A* [25]. 

Our approach assumes that both the intensity properties and the geometrical 
shapes of the target path (i.e. the edge contour) can be determined statistically. 
This path can be considered to be a set of elementary path segments joined 
together. We first consider the intensity properties along the edge and then the 
geometric properties. 

The image properties of segments lying on the path are assumed to differ, in 
a statistical sense, from those off the path. More precisely, we can design a filter 
0(.) with output {yx = 0(/(x))} for a segment at point x so that: 

P{yx) = PoniVx)^ on the true path 

P{yx) = PoffiVx), if lies of f the true path. ( 1 ) 



For example, we can think of the {px} as being values of the edge strength 
at point X and Pon^Poff being the probability distributions of the response of 
0(.) on and off an edge. The set of possible values of the random variable px 
is the alphabet with alphabet size M. See [12], [5] examples of distributions for 
Pon,Poff used in computer vision applications. 

We now consider the geometry of the target contour. We require the path 
to be made up of connected segments xi, X 2 , . . . , xw- There will be a Markov 
probability distribution Pg{xi-^i \xi) which specifies prior probabilistic knowledge 
of the target. It is convenient, in terms of the graph search algorithms we will 
use, to consider that each segment x has a set of Q possible continuations. 
Following terminology from graph theory, we refer to Q as the branching factor. 
We will assume that the distribution Pg depends only on the relative orientations 
of XiJ^i and Xi. In other words, Pg{xiJ^\\xi) = PAg{xiJ^\ — Xi). An important 
special case is when the probability distribution is uniform for all branches (i.e. 
PAgi^x) = U{Ax) = l/Q MAx). 

By standard Bayesian analysis, the optimal path = {x?, . . . maxi- 

mizes the sum of the log posterior: 



^(X) = ^log 



Pon{y{xi)) 

Poff{y{xi)) 






PAg{Xi-\-l xf) 

U{xiJ^i-Xi) 



(2) 



where the sum i is taken over all segments on the target. U{xiJ^\ — Xi) is the 
uniform distribution and its presence merely changes the log posterior E{X) by 
a constant value. It is included to make the form of the intensity and geometric 
terms similar, which simplifies our later analysis. 

We will refer to E{X) as the reward of the path X which is the sum of the 
intensity rewards log and the geometric rewards log • 
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It is important to emphasize that our results can be extended to higher-order 
Markov chain models (provided they are shift- invariant). We can, for example, 
define the x variable to represent spatial orientation and position of a small 
edge segment. This will allow our theory to apply to models, such as snakes, 
used in recent successful vision applications [2], [12]. (It is straightforward to 
transform the standard energy function formulation of snakes into a Markov 
chain by discretizing and replacing the derivatives by differences. The smoothness 
constraints, such as membranes and thin plate terms, will transform into first 
and second order Markov chain connections respectively). Recent work by Zhu 
[33] shows that Markov chain models of this type can be learnt using Minimax 
Entropy Learning theory from a representative set of examples. Indeed Zhu goes 
further by demonstrating that other Gestalt grouping laws can be expressed in 
this framework and learnt from representative data. 

Reward functions, such as equation (2), are ideally suited to A* graph/tree 
search algorithms [21], [23] and we will therefore analyze A* algorithms later in 
this paper, see section (4). As we will describe. A* searches the nodes - possible 
branches of the road/snake - which are most promising. The “goodness” /(n) of 
a node n is g{n) -h h{n) where g{n) is the reward to get to the node and h{n) is a 
heuristic reward to get to the finish from n. The A* algorithm starts at the top 
of the tree and evaluates the child nodes (i.e. those connected to the top node 
by a single arc). These child nodes are placed in the queue. As the algorithm 
proceeds it selects the member of the queue with best evaluation, removes it 
from the queue, expands its children and enters them in the queue. 

The evaluation of the nodes is based on the reward to reach it from the 
top node (i.e. the sum of the log posteriors) and on a heuristic reward based 
on anticipated future performance. More precisely, a path segment ending at 
X has a total reward f{x) = g{x) ^ h{x) (note that the nonoverlapping path 
requirement implies that x determines a unique path to the initialization point). 
The choice of heuristic reward h{x) is very important to the algorithm [21]. It 
can be proven that if h{x) is an upper bound on the reward to get to the end 
of the path then A* is guaranteed to find the global maximum eventually. An 
A* algorithm whose heuristic satisfies this bound is called admissible. One that 
does not is called inadmissible. The problem is that admissible A* algorithms are 
guaranteed to find the best result but may do so slowly. By contrast, inadmissible 
A* algorithms are often faster but may fail in certain cases. 

Karp and Pearl [21] provided a theoretical analysis of convergence rates of 
A* search. They studied a binary tree where the rewards for each arc were 0 or 
1 and were specified by a probability p. They then studied the task of finding 
the minimum reward path. This is an interesting task but it differs from ours in 
many respects. From our perspective, it resembles the task of finding the best 
path in the noise/clutter rather than detecting a true target in the presence of 
noise/clutter. 

There are three elements to our proofs. The first is the use of Sanov’s theorem 
to put exponential bounds on the probabilities of rare events - this theorem is 
described in section (3). The second is an onion peeling strategy to recursively 
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explore the search tree, this is described more in [26]. The third is the summation 
of exponential series, generated by Sanov’s theorem, which is described in more 
detail in [6]. 

3 Sanov’s Theorem 

This section introduces results from the theory of types [7] which we will use 
to prove our results. We will be particularly concerned with Sanov’s theorem, 
which we state without proof later this section. To motivate this material we will 
apply it to the problem of determining whether a given set of measurements are 
more likely to come from a road or non-road but without making any geometri- 
cal assumptions about the likely shape of the road. The theorem assumes that 
we have an underlying distribution Q which generates a set of N independent 
identically distributed (i.i.d.) samples. From each sample set we can determine 
an empirical normalized histogram, or type, see figure (3). (This normalization 
ensures that the components of each type sums to one and hence can be inter- 
preted as an empirical distribution). The law of large numbers states that these 
empirical histograms (when normalized) must become similar to the distribution 
Q 8iS N oo. Sanov’s theorem puts bounds on how fast the empirical histograms 
converge (in probability) to the underlying distribution. Thereby it puts bounds 
on the probability of rare events. 




Fig. 3. Samples from an underlying distribution. Left to right, the original dis- 
tribution, followed by histograms, or types, from 10, 100, and 1000 samples from 
the original. Observe that for small numbers of samples the types tend to differ 
greatly from the true distribution. But for large N the law of large numbers says 
that they must converge (with high probability). 



More precisely, Sanov’s Theorem states: 

Sanov’s Theorem. Let y\,y 2 ^ be i.i.d. from a distribution Q{y) with 

alphabet size J and E he any elosed set of probability distributions. Let Pr{<p G 
E) be the probability that the type <f> of a sample sequenee lies in the set E. Then: 

o(8)ArD(0°«) n 

€E)<{N + «), (3) 

where 0^ = argmin 0 D ED{(j)\ \Q) is the distribution in E that is elosest to Q in terms 
of Kullbaek-Leibler divergenee, given by D{(j)\\Q)="^y^^(f){y)log{(l){y)/Q{y)). 
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Fig. 4. Left, Sanov’s theorem. The triangle represents the set of probability 
distributions. Q is the distribution which generates the samples. Sanov’s theorem 
states that the probability that a type, or empirical distribution, lies within the 
subset E is chiefly determined by the distribution in E which is closest to Q 
(in the sense of Kullback-Leibler). Right, Sanov’s theorem for the coin tossing 
experiment. The set of probabilities is one-dimensional and is labelled by the 
probability p{head) of tossing a head. The unbiased distribution Q is at the 
centre, with P{head) = 1/2, and the closest element of the set E is P^ such that 
P^{head) = 0.7. 

This is illustrated by figure (4). Intuitively, it shows that, when considering 
the chance of a set of rare events happening, we essentially only have to worry 
about the “most likely” of the rare events (in the sense of Kullback-Leibler 
divergence) . Most importantly, it tells us that the probability of rare events falls 
off exponentially with the Kullback-Leibler divergence between the rare event (its 
type) and the true distribution. This exponential fall-off is critical for proving 
the results in this paper. Note that Sanov’s theorem involves an alphabet faetor 
{N . This alphabet factor becomes irrelevant at large N (compared to the 
exponential term). It does, however, require that the distribution Q is defined 
on a finite space, or can be well approximated by a quantized distribution on a 
finite space. 

Sanov’s theorem can be illustrated by a simple coin tossing example, see 
figure (4). Suppose we have a fair coin and want to estimate the probability of 
observing more than 700 heads in 1000 tosses. Then set E is the set of probability 
distributions for which P( /lead) > 0.7 {P {head) PP {tails) = 1). The distribution 
generating the samples is Q{head) = Q{tails) = 1/2 because the coin is fair. The 
distribution in E closest to Q is P^{head) = 0.7^ P^ {tails) = 0.3. We calculate 
D{P^\\Q) = 0.119. Substituting into Sanov’s theorem, setting the alphabet size 
J = 2, we calculate that the probability of more than 700 heads in 1000 tosses 
is less than 2®ii^ x (1001)^ < 2®^^. 

In this paper, we will only be concerned with sets E which involve the 
rewards of types. These sets will therefore be defined by linear constraints 
on the types - in particular, constraints such as 0 • a > T, where a{y) = 
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\og{Pon{y)/Poff{y)), 2/ = 1 , •••, j- (We define <p a = Y^y=i (j){y)a{y)). This will 
enable us to derive results which will not be true for arbitrary sets E. We will 
often, however, be concerned with the probabilities that the rewards of samples 
from one distribution are greater than those from a second. It is straightforward 
to generalize Sanov’s theorem to deal with such cases. 

This leads to: 

Theorem 1. The probability that a sequence of samples from on-road has 
lower intensity reward than a sequence of samples from off-road is bounded below 
by {N + i)^2J202ArB(P,,,p,^^) ^ ;^)2J2^2Arp(p,,,p,^^)^ 

B{Pon,Poff) = -^og{J 2 y^^P^lf(y)Pon‘^(y)}. (N is the number of elements in 
each sequence of sample.) 

Proof. This is a generalization of Sanov’s theorem to the case where we have 
two probability distributions and two types. We define E = : cp^^^ • 

a > <p^^ • a}. We then apply the same strategy as for the Sanov proof but 
applied to the product space of the two distributions Pon^Poff- This requires us 
to minimize: 



= ND{4>°ff\\Poff)+ND{nPon) 

j j 

+n{y] 4>"^gy) - 1} + 'T2{ Y - 1} + • a - ■ a}, (4) 

y=l y=l 

where the r ’s and 7 are Lagrange multipliers. The function /(.,.) is convex in the 
(p^^^ , (p^'^ and the Lagrange constraints are linear. Therefore there is a unique 
minimum which occurs at: 

subject to the constraint <p^^ -ex = cp^^Pex. The unique solution occurs when 7 = 
1/2 (because this implies (p^’fP = cp^'^^ and so the constraints are satisfied.) We 

define (pBh = <Px®^{i/2) = Pon^ Plf‘^f / ^[^ / ‘A 'is short for Bhattacharyya). 

We therefore obtain: 



-^^2J2<^N%D{cf)Bh^off) + D(cf)Bh^on)<>^ ^ 0 ^ 

We define B{Pon,Poff) = {l/2){D{(pBh\\Pof f) P D{(pBh\\Pon)} • Substitut- 

ing in for (psh from above yields B{Pon, Poff) = ~ ^og{J 2 'Li Pofj(y)PP(y)}- 
Hence result. 

This result tells us that the probability that an off-road sequence has higher 
reward than an on-road sequence falls off as for large N. We 

define the fall-off factor (i.e. the negative of the coefficient of N in the expo- 
nent) to be the order parameter. For this task, the order parameter is there- 
fore 2B{Pon^ Poff) which is always positive (except for the degenerate case 
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Pon = Poff for which it becomes zero). B {Pon ^ Poff) is just a measure of the 
distance between Pon and Poff and we will refer to it as the Bhattacharyya dis- 
tance (because it is identical to the Bhattacharyya bound for Bayes error, see 
[22]). For this task, unlike tree search (see next section), there is no critical point 
and no phase transition. 

4 Tree Search: A* and Inadmissible Heuristics. 

Our main result of this section is to prove convergence of A* algorithms with 
inadmissible heuristics. We prove that convergence is achieved with 0(N) ex- 
pected nodes opened and we put bounds on the expected errors of the solutions. 
It can also be proved that the expected search time per node is constant (i.e. 
independent of N) [6]. 



4.1 A* Convergence for the Bhattacharyya Heuristic 

We now want to consider a traditional A* search strategy using a heuristic 
function but no pruning. In this section, we will formulate the problem for any 
heuristic and then obtain bounds for a special case, which we call the Bhat- 
tacharyya heuristic (again because it is directly related to the Bhattacharyya 
bound). In the following section, we will generalize our results of other inadmis- 
sible heuristics. 

For a node Wm, at distance M from the start, we let g(WM) be the mea- 
sured reward and h{WM) be the heuristic function. The A* algorithm proceeds 
by searching the node in the queue for which the combined reward /{Wm) = 
q{Wm) + ^{Wm) is greatest. How many nodes (or arcs) do we expect to search 
by this strategy? And what are the expected errors in our solutions? 

The reward to reach Wm is just the reward of the log-likelihood data and 
prior terms along the path from the start to Wm- We define the heuristic reward 
^{Wm) = {N — M){Hl P Hp) where Hp and Hp are constants {Hp and Hp 
are heuristics for the likelihood and the prior respectively). As we have shown 
[6], convergence results can be proven for a range of values of and Hp. In 
this paper, however, we will only consider special values H^^Hp for which the 
analysis simplifies. 

Observe that a path segment will be visited only if the reward to get to it 
(including its heuristic reward) is sufficiently high. More precisely, if a segment 
n of a false path is searched then this implies that its reward is better than the 
reward of at least one point on the target path. This is because the A* algorithm 
always maintains a queue of nodes to explore and searches the node segment 
with highest reward. The algorithm is initialized at the start of the target path 
and so an element of the target path will always lie in the queue of nodes that 
A* considers searching. Hence a node will never be explored if its reward is lower 
than all the rewards on the target path segments. 

Since the length of all possible paths is constant we can ignore the constant 
factor N{Hp P Hp) and the heuristic will then merely penalize path segments 
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which have been tested. Then a false path of length n and a true path of length m 
will have effective rewards denoted by the random variables Soff{n) and 5'on(^)- 



n 

Soff{n) = 



Pon {Vxi ) 
PoffiUxi) 



n 

PL}off + 



Pag { xj+i - Xj) 

U{xi+i - Xi) 



Pp)off, 



m 

Son{m) = y]{log 
i=l 



Pon {Vxi ) 
PoffiVxi) 



m 

Hl} on + 

i=l 



Pag ( xj+i - Xj) 

l/(Xi+i - Xi) 



Pp}oJ)7) 



where the subscripts off and on are used to denote false and true paths re- 
spectively (paths with a mixture of true and false segments will be dealt with 
later). 

We now define types for false and true road samples 

with 0 corresponding to the data and -0 to the prior. These types are normalized 
so that their components sum to 1, i.e. = 1, Ylu=i The types 

will be computed for samples of variables lengths n,m. These lengths will be 
clear from the context so we will not label them explicitly (i.e. we will not use 
notation like (fn to denote types taken from n samples). 

Therefore we express the rewards of two sequences Soff{n) and Son{'^) by: 

Soffin) = -a-HL} + • /3 - Hp}, 

Sonim) = m{4>°'^ ■ Ct - Hl} + ■ (3 - Hp}, (8) 



where a{y) = \og{Pon{y) / Pof f{y)) and f3{5x) = \og{PAg{5x) /U {5x)). 

Recall that if a segment n of a false path is searched then its reward must 
be better than the reward of at least one subpath along the target path. This 
means that we should consider Pr{3m : Sof^{n) > S'on(^)}- This, however, is 
hard to compute so we bound it above by J]m=o P^{^off{"^) > Son{"^)} (using 
Boole’s inequality). 

Our first result is Theorem 2, which is proven using Sanov’s theorem (in- 
cluding the use of constrained optimization to find the fall-off coefficients) and 
results for the sums of exponential series. The main point of this result is to 
show that the chance of an off-road path having greater reward than any true 
road path falls off exponentially with the length of the off-road path. 

We first define two sub- order parameters = D{(j)Bh\\Pof f) P D Bh\\U) 
and P 2 = D{4>Bh\\Pon) + D{'i!)Bh\\PAG)‘ {'^Bh is defined analogously to 4>Bh 
- see Theorem 1). These parameters will determine the convergence and error 
rates of the algorithm by means of the two functions: 



1 Q<S>WS>eO 

Cl{P) = 5 C2{P) = { _ g(g)fi^^(8)eO)2 T “(^5^)}: (9) 

where H, S are constants (independent of N) whose exact forms are given in [6]. 

It will be shown, in Theorem 3, that the order parameter for this problem is 
K = Pi~]-p 2 — logQ- This can be re-expressed as 2B{Pon-Pof f) P‘^B{PAg, U) — 
logQ (observe the similarity with the result of Theorem 1). Note that this result 
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depends on the search algorithm being A*. However, it has been shown that an 
identical order parameter is obtained [27] when analyzing whether the target 
can be detected by any algorithm which computes the MAP estimate. 

Theorem 2. The A* algorithm, using the Bhattacharyya heuristic = 
4>Bh ' OL and Hp = 'il^sh ' P, gives: 



Pr{Soff{n) > Sonim)} < {{n + l)(m + (10) 

Moreover, the probability of a particular false path segment being searched falls 
off, to first order in n, as where n is the number of segments by 

which this path segment diverges from the target path. 

Proof. This first part of the proof is again a generalization of Sanov applied 
to product distributions, see Theorem 1. The new twist is that we have different 
length factors n and m and the heuristics ( also we consider distributions on a 
four- dimensional product space instead of two dimensions). But for the Bhat- 
tacharyya heuristics this will make no difference. (More general heuristics are 
dealt with in [6]). Define: 

E = -0'’”) : ■ cx - Hi + ■ /3 - ffp} 

>m{(j)°^ -a- Hl + ip'^^ ■ I3- Hi}}. (11) 

Applying the strategy from Theorem 1, we must minimize: 

t/,°") = nD{ct>°ff\\Poff) + nD{xP°ff\\U) 
+mD{(j>'^^\\Pon) +TOD(-0°"||-P/iG) + - 1} + ~ 1} 

+7{m{</)°"- cx-Hl+i,’’^- (3-Hl}-n{4>'’^f ■ /3-HI}}, (12) 



where the r’s and 7 are Lagrange multipliers. As before, we know that this func- 
tion is convex so there is a unique minimum. Observe that /(....) con- 

sists of four terms offormnD{cf>^-^’^\\Poff)-\-ri{^(t)^’^’^} — nj(t)^’^’^’a. which are 
coupled by shared constants. These terms can be minimized separately to give: 



p7 pl07 

^o//D ^ on off ^onU 

^ Z [1 - 7 ] ’ ^ 



Pon^P^ff ,offU^PMPl 

z[l] ^ 2 [ 1 - 7 ]’ 



^onW 



pl(g)7 



t/7 



^2 [ 7 ] ’ 

(13) 



subject to the constraint given by equation (11). 

As before, we see that the unique solution occurs when 7 = 1/2. In this case: 



(f,offO . Q, ^ JjD ^ ^onU . ^o//D . ^ ^ ^ ^onD . 



The solution occurs at 4>Bh^'4^Bh (4^\®^{i/2) '4^ ^l®^{l/ 2 ))^ Hence the first 

result. 

We must now sum over m to obtain the bound for P{3m : Soff{n) > 
5'on(^)}- Hor large m, the alphabet terms are unimportant and we just need to 
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sum the geometric series. However, we must add extra terms ^{e,^2) to correct 
for the alphabet factors for small m, see [6]. Hence 

Pr{3m : Soff{n) > Son{m)} < (n + ( 15 ) 



We can now state our main result about the convergence of A* using the 
Bhattacharyya heuristic. Our result, Theorem 3 , builds on Theorem 2 by adding 
an onion peeling argument combined with the summation of exponential series. 
The key concept here is the onion-like structure of the tree representation, see fig- 
ure ( 5 ). This structure allows us to classify all paths in terms of sets Fi, F2, F3, ... 
which depend on where they branch off from the true path. Paths which are al- 
ways bad (i.e. completely false) correspond to F\. Paths which are good for one 
segment, and then go bad, form F2 and so on. By peeling off segments it can be 
shown that results for F\ can be readily adapted to F2 , F3 , . . . . 




Fig. 5. Left: We can divide the set of paths up into N subsets Fi,...,Fat as 
shown here. Paths in F\ are completely off-road. Paths in F2 have one on-road 
segment and so on. Intuitively, we can think of this as an onion where we peel off 
paths stage by stage. Right: When paths leave the true path they make errors 
which we characterize by the number of false arcs. For example, a path in F\ 
has error N , a path in Fi has error N — i. 



Theorem 3. Provided > \ogQ, the expected number of searches is 0{N) 
in the size of the problem and is bounded above by Ci{p2)Ci{Fi — \ogQ)N. 
Moreover, the expected error in convergence is bounded above by Ci{p2)C2{^i — 
logQ); which is small, independent of the size N of the problem, and decays 
exponentially with — logQ. The order parameter K = +F2 — logQ- 

Proof. We use the onion peeling strategy to express the expectation in terms 
of the expected number of nodes searched in Fi, F2, F3..., F/v- By the structure 
of our problem the expectations will be bounded by the same number for all Fi. 
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Therefore the bound is linear provided the expeetation for Fi is finite. More 
preeisely, we get where \Fi\ is the eardinality of Fi. 

Theorem 2 gives us a bound that a speeifie path of length n in Fi will have 
higher reward than any subpath of the true path (a subpath must start at the 
beginning of the target path). We determine that the expeeted number of paths 
in Fi of length n, with rewards higher than any subpath of the target path, is 
bounded above by Ci{F 2 ){nF l)2^+2QQn2<8)n»z^i^ equation (15), where Ci{F) 
is speeified by equation (9). This ean be summed overn again taking eare with the 
alphabet factors, see [6] to obtain Ci(jj^2){ i«i°gQ)+° (^i -logQ))} = 

Ci{Fi—\ogQ)Ci{F 2 ). This is finite provided Fi >\ogQ. Our first result follows. 

To put bounds on the expeeted errors of the algorithm we measure the error in 
terms of the expeeted number of off-road segments in the best path found by A *. 
We use the onion peeling strategy again and eonsider the probability Pr{n) that 
A * will explore a path in F/v+i(g)n to the end, for any n, instead of proeeeding 
along the true path. If this happens we will get an error of size n. The expeeted 
error ean then be bounded above by X]n=i P'^(o)n. 

We want to put an upper bound on Pr{n). Observe that a path in FN-\-i^n will 
be followed to the end only if its reward is greater than the heuristie reward along 
the true path, or the reward of one are of the true path plus the heuristie reward 
for the remainder, or the reward for two true ares plus the heuristie reward for the 
rest, and so on. We ean apply Sanov to get probability bounds for these by using 
the eonstraints 'f3}F{n — m){H^FHp}, 

where m = 1, n is the number of ares of the true path that are explored. These 
eonstraints, of eourse, are the same eonstraints > 

• a + • /3 - - H^} whieh we used in Theorem 2 above. Therefore, 

by Poolers inequality, 

D 

Pr(n) < Q" ^ {(n + l)(m + l)}2^+2« x (16) 

m=0 

As before, we ean sum the series with respeet to m, see [6], to obtain: 

Pr{n) < Ci{’p 2 ){n + i)2J+2Q2®"1®'i®i°gO^, (17) 

The expeeted error is then bounded above by dominant, 

exponential terms, ean be summed (see [6]) yielding: 

< Error >< C 2 {Fi - \ogQ)Ci{F 2 )- (18) 



5 Conclusion 

Our analysis shows it is possible to track certain classes of image contours with 
linear expected node expansions (linear expected sorting time per node is shown 
in [6]). We have shown how the convergence rates, and the choice of A* heuris- 
tics, depend on order parameters which characterize the problem domain. In 
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particular, the entropy of the geometric prior and the Kullback-Leibler distance 
between Pon and Poff allow us to quantify intuitions about the power of geo- 
metrical assumptions and edge detectors to solve these tasks. Not surprisingly, 
the easiest target curves to detect are those for which the edge detector is most 
informative and the prior geometric knowledge most constraining. Our analy- 
sis allows us to quantify these intuitions. See [19] for analysis of the forms of 
Pon^Poff arising in typical images. 

Our more recent work [27] has extended this work by showing that similar 
order parameters can he used to specify intrinsic (algorithm independent) diffi- 
culty of the search problem and that phase transitions occur when these order 
parameters take critical values. Fortunately, the proofs in this paper break down 
at closely related critical points. Therefore A* algorithms are an effective way 
to solve this problem in the regime for which it can be solved. 

As shown in [25] many of the search algorithms proposed to solve vision 
search problems [20], [2], [12] are special cases of A* (or close approximations). 
We therefore hope that the results of this paper will throw light on the success 
of the algorithms and may suggest practical improvements and speed ups, see 
[5] for promising preliminary results. 

Crucial to our analysis has been the use of Bayesian probability theory both 
to determine an optimization criterion for the problem we wish to solve and 
to define the Bayesian ensemble of problem instances. Analysis of the Bayesian 
ensemble led to the definition of order parameters which characterized the dif- 
ficulty of the problem. It will be interesting to compare our results with those 
obtained by [4] , [24] for completely different classes of problems and using differ- 
ent techniques. This is a topic for further research. 
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Abstract. Many tasks in computer vision involve assigning a label (such 
as disparity) to every pixel. These tasks can be formulated as energy min- 
imization problems. In this paper, we consider a natural class of energy 
functions that permits discontinuities. Computing the exact minimum 
is NP-hard. We have developed a new approximation algorithm based 
on graph cuts. The solution it generates is guaranteed to be within a 
factor of 2 of the energy function’s global minimum. Our method pro- 
duces a local minimum with respect to a certain move space. In this 
move space, a single move is allowed to switch an arbitrary subset of 
pixels to one common label. If this common label is ol then such a move 
expands the domain of a in the image. At each iteration our algorithm 
efficiently chooses the expansion move that gives the largest decrease in 
the energy. We apply our method to the stereo matching problem, and 
obtain promising experimental results. Empirically, the new technique 
outperforms our previous algorithm [6] both in terms of running time 
and output quality. 



1 Energy Minimization in Early Vision 

Many early vision problems require estimating some spatially varying quantity 
(such as intensity, disparity or texture) from noisy measurements. Such quanti- 
ties tend to be piecewise smooth; they vary smoothly at most points, but change 
dramatically at object boundaries. Every pixel p ^ V must be assigned a label 
in some set £; for motion or stereo, the labels are disparities, while for image 
restoration they represent intensities. The goal is to find a labeling / that assigns 
each pixel p G P a label fp G £, where / is both piecewise smooth and consistent 
with the observed data. 

These vision problems can be naturally formulated in terms of energy mini- 
mization. In this framework, one seeks the labeling / that minimizes the energy 



^smoothif^ ^data{f^ ' 

Here E smooth measures the extent to which / is not piecewise smooth, while Edata 
measures the disagreement between / and the observed data. Many different 
energy functions have been proposed in the literature, depending upon the exact 
vision problem. The form of E^ata is typically Edataif) = EpOp-^p(/p)’ where 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 205-220, 1999. 
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Dp measures how appropriate a label is for the pixel p given the observed data. In 
the image restoration problem, for example, usually Dp{fp) = [fp — ip)^ ^ where 
ip is the observed intensity of the pixel p. 

The choice of Egmooth is a critical issue, and many different functions have 
been proposed. For example, in regularization-based vision [2,12,15], Egmooth 
makes / smooth everywhere. This leads to poor results at object boundaries. En- 
ergy functions that do not have this problem are called discontinuity-preserving. 
A large number of discontinuity-preserving energy functions have been proposed 
(see for example [11,14,19]). Geman and Geman’s seminal paper [9] gave a 
Bayesian interpretation of many energy functions, and proposed a discontinuity- 
preserving energy function based on Markov Random Fields (MRF’s). 

The major difficulty with energy minimization for early vision lies in the 
enormous computational costs. Typically these energy functions have many local 
minima (i.e., they are non-convex). Worse still, the space of possible labelings has 
dimension |P|, which is many thousands. There have been numerous attempts 
to design fast algorithms for energy minimization; we will review this area in 
section 2. However, as a practical matter the computational problem remains 
unresolved. 

In this paper we address a class of discontinuity-preserving energy functions. 
Let the neighborhood system JV denote the set of pairs of adjacent pixels in V. 
We consider functions of the form 

Ep{f) = Y (1) 

fp,q<0Q A/” pUV 

where ■( f ^ f 

= (i ittt 

We allow Dp to be an arbitrary function, as long as it is non- negative and finite.^ 

This energy function is in some sense the simplest energy that preserves dis- 
continuities. The smoothness term provides a penalty > 0 for assigning 

different labels to two adjacent pixels {p,q}. This penalty does not depend on 
the labels assigned, as long as they are different. Such energy functions natu- 
rally arise from a particular MRF that we call a generalized Potts model; this 
derivation is given in [6] . We will therefore refer to the energy function Ep given 
in equation (1) as the Potts energy. 

In early vision, there are a few energy functions whose global minimum can be 
rapidly computed [6,10,13]. Unfortunately, we have shown in [6] that minimizing 
the Potts energy is NP-hard, so it very likely requires exponential time. In this 
paper we introduce a new approximation algorithm for this energy minimization 
problem, and apply it to several vision problems. The key properties of our 
algorithm are that it produces a local minimum in a certain move space, and 
that the resulting labeling is guaranteed to be within a factor of 2 of the global 
minimum of the Potts energy. 



^ Our results do not require that Dp be finite, but this assumption simplifies the 
presentation considerably. 



A New Algorithm for Energy Minimization with Discontinuities 207 



We begin with a brief survey of energy minimization methods in computer 
vision. In section 3 we give an overview of our approach to energy minimization. 
We define expansion moves and prove that a local minimum of the Potts energy 
with respect to such moves is within a factor of two of the global minimum. 
Section 4 gives the details of a graph cut technique that efficiently computes the 
expansion move producing the largest decrease in the Potts energy. Intuitively, 
this gives the direction of the steepest descent from a current solution. By using 
this technique iteratively we follow the “fastest” way into a local minimum of 
the Potts energy with respect to expansion move space. In section 5 we provide 
some experimental results on the stereo matching problem. 

Note that in our earlier work [6] we presented a similar greedy descent algo- 
rithm for approximate Potts energy minimization based on swap moves. Strictly 
speaking, expansion moves and swap moves are not directly comparable since 
there are swap moves that are not expansion moves and vice versa. However, 
we have both theoretical and experimental evidence that the expansion move 
algorithm is superior. Theoretically, we will show in section 3.3 that a local min- 
imum in terms of expansion moves is within a factor of 2 of the global minimum, 
while no such result is available for the swap move algorithm. Experimentally, 
the results in section 5 suggest that the expansion moves algorithm leads to a 
better and faster optimization of the Potts energy. 

2 Related Work 

Energy minimization is quite popular in computer vision, and a wide variety 
of methods have been used. An exhaustive survey is beyond the scope of this 
paper; however, we will briefly describe the energy minimization methods that 
are most prevalent in vision. 



2.1 Global Energy Minimization 

The problem of finding the global minimum of an arbitrary energy function is 
obviously intractable (it includes the Potts energy minimization problem as a 
special case). As a consequence, any general-purpose energy minimization algo- 
rithm will require exponential time to find the global minimum, unless P=NP. 

Simulated annealing was popularized for vision by [9] , and is the only general- 
purpose global energy minimization method in widespread use. With certain 
annealing schedules, annealing can be guaranteed to find the global minimum. 
Unfortunately, the schedules that lead to this guarantee are extremely slow. In 
practice, annealing is inefficient partly because at each step it changes the value 
of a single pixel. 

Graph cuts can be used to find the global minimum of certain energy func- 
tions. These algorithms permit Dp to be arbitrary. [10] addressed the case of 
\L\ = 2. This result was generalized by [6,13] to handle label sets of arbitrary size, 
when the smoothness energy is of the form a/" \ fp ~ fq\' This smoothness 

energy, unfortunately, leads to oversmoothing at object boundaries. In addition. 
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there must be a natural isomorphism between the label set L and the integers 
{1,2, This rules out some significant problems such as motion. 

Another alternative is to use methods that have optimality guarantees in 
certain cases. Continuation methods, such as GNC [4], are the best-known ex- 
ample. These methods involve approximating an intractable (non-convex) energy 
function by a sequence of energy functions, beginning with a tractable (convex) 
approximation. At every step in the approximation, a local minimum is found 
using the solution from the previous step as the starting point. There are cir- 
cumstances where these methods are known to compute the optimal solution 
(see [4] for details). Continuation methods can be applied to a large number of 
energy functions, but except for these special cases nothing is known about the 
quality of their output. 



2.2 Local Energy Minimization 

Due to the inefficiency of computing a global minimum, many authors have opted 
for a local minimum. One problem with this is that it is difficult to determine the 
cause of an algorithm’s failures. When an algorithm gives unsatisfactory results, 
it may be due either to a poor choice of energy function, or to the fact the answer 
is far from the global minimum. There is no obvious way to tell which of these is 
the problem.^ Another issue is that local minimization techniques are naturally 
sensitive to the initial estimate. 

There are several ways in which a local minimum can be computed. By phras- 
ing the energy minimization problem in continuous terms, variational methods 
can be applied. These methods were popularized by Horn [12]. Variational tech- 
niques use the Euler equations, which are guaranteed to hold at a local minimum 
(although they may also hold elsewhere). A number of methods have been pro- 
posed to speed up the convergence of the resulting numerical problems, includ- 
ing (for example) multigrid techniques [18]. To apply these algorithms to actual 
imagery, of course, requires discretization. An alternative is to use discrete re- 
laxation methods; this has been done by many authors, including [7,16,17]. 

It is important to note that a local minimum is defined relative to a set 
of allowed moves. Most existing minimization algorithms find a local mini- 
mum relative to “small” moves, which typically are defined in terms of the 
L 2 distance. To be precise, they attempt to compute a labeling / such that 
/ = argminj^(g)jDjj^g Ep(/^), for some small e. 

In [6] we described an algorithm for approximate minimization of the Potts 
energy based on swap moves. For a fixed pair of labels a, /3, this move swaps the 
labels between a subset of pixels labeled a and another subset labeled f3. The 
algorithm in [6] is based on a graph cut technique that efficiently computes the 

^ In the special cases where the global minimum can be rapidly computed, it is possible 
to separate these issues. For example, [10] points out that the global minimum of an 
Ising energy function is not necessarily the desired solution for image restoration. 
[5,10] analyze the performance of simulated annealing in cases with a known global 
minimum. 
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best a, /3-swap move from a current solution. By iterating over all distinct pairs 
a,/3 this technique enables the steepest descent search of the local minimum 
of the Potts energy with respect to swap moves. The properties of such a local 
minimum are based on the strength of swap moves. 

In this paper we describe a new algorithm based on expansion moves. The 
structure of the algorithm is similar to [6] . It is still the greediest descent into a 
local minimum with respect to a certain move space. However, there are several 
important differences. Most importantly, the new algorithm produces a local 
minimum that is guaranteed to be within a factor of 2 from the global minimum. 
Such a bound is not available for the swap move algorithm in [6]. Moreover, the 
steepest descent in the new move space requires iterating over distinct labels, not 
pairs of labels. Altogether this suggest that the new algorithm can potentially 
produce faster and better solutions. The data we present in section 5 supports 
this conclusion. 

3 The Expansion Move Algorithm 

Here we describe the algorithm for approximate minimization of the Potts energy 
Ep based on expansion moves. In this section, we discuss the expansion moves, 
which are best described in terms of partitions. We sketch the algorithm and list 
its basic properties. Then we introduce the notion of a graph cut, which is the 
basis for our method. 



3.1 Partitions and Move Spaces 

Any labeling / can be uniquely represented by a partition of image pixels 

P = {Vi\leC} 

where Vi = {p G V \ fp = 1} is a subset of pixels assigned label 1. Since there is 
an obvious one to one correspondence between labelings / and partitions P, we 
can use these notions interchangingly. 

Given a label o, a move from a partition P (labeling /) to a new partition 
P^ (labeling /^) is called an a- expansion if Va C and Vi C Vi for any label 
I a. In other words, an o-expansion move allows any set of image pixels to 
change their labels to a. 

Note that a move which gives an arbitrary label a to a single pixel is an 
(^-expansion. As a consequence, the standard move space used in annealing is a 
special case of our move space. 



3.2 Algorithm and Properties 

The structure of the expansion move algorithm is shown in figure 1 . We will call 
a single execution of steps 3. 1-3.2 an iteration^ and an execution of steps 2-4 
a eyele. In each cycle, the algorithm performs an iteration for every label in a 
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certain order that can be fixed or random. A cycle is successful if a strictly better 
labeling is found at any iteration. The algorithm stops after the first unsuccessful 
cycle since no further improvement is possible. 



1 . Start with an arbitrary labeling / 

2. Set success := 0 

3. For each label a E C 

3.1. Find / = arg min among f' within one o-expansion of / 

3.2. If Ep(f) <Ep(f)y set / := / and success := 1 

4. If success = 1 goto 2 

5 . Return / 



Fig. 1. The expansion move algorithm. 



The algorithm have a number of important properties. 

— Obviously, a cycle takes \C\ iterations. Note that a cycle in the swap move 
algorithm [6] takes \C\‘^ iterations. 

— The algorithm is guaranteed to terminate in a finite number of cycles, al- 
though there is no bound beyond the trivial one of \C\^^. Nevertheless, in 
the applications we have considered the algorithm stops after a few cycles. 
Moreover, most of the improvements occur during the first cycle, as we will 
show in section 5. 

— Once the algorithm has terminated, the energy of the resulting labeling is a 
local minimum with respect to an expansion move. 



3.3 Optimality Guarantees 

We now show that if is a local minimum in terms of expansion moves, then 
Ep{f^) < 2 • Ep{f^)^ where is the optimal solution minimizing the Potts 
energy Ep. Let = {V^ | a G £} be a partition corresponding to so that 

K = k e p I /; = a} 

is a set of pixels assigned to a in the optimal solution. We can produce a labeling 
within one ^-expansion move from as follows: 

fa if p G 

\/° 

The key observation is that since is a local minimum in the expansion move 
space then for any a G £ 



Ep{f) < Ep{P). 



( 2 ) 
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For a given label a G £ we can split the Potts energy of any labeling / into 
three terms Ep{f) = + E^/{f) + E"^{f) where 

Knif) = E 

V pU 

p'qD -pf 



EW) = E 



^p'qOQ J\f 
pD Pf 'qD 



EW) 



'^fp,gO’ ^(/p 7 ^ fq) 



Ip'qOQ JV 
p^qU -Pf 



pU 



correspond to the parts of the Potts energy Ep{f) concentrated at the pixels 
inside 7^^, at the boundary of 7^^, and at the pixels outside of 7^^, correspond- 
ingly. 

Since = /“ for any p ^ then = E^^(f^). Thus, (2) implies 

that for any a £ C 



EW°) + EUf) < EfW) + EUn- 

Since fp=fp=c^ for any p e then = Ef^{f^). Moreover, 

Eun < E = Ew°)- 

Ip'qOQ N 
pD Pf ^qD Pf 

Therefore, (3) implies that for any a ^ C 

Efnif) + Pm(/°) < EWl + EWl- 

Summing up inequality (4) over all labels a G £ we obtain 

Ep{f )+ E < Ep{n+ E 

lp,gd] B fp,g<g B 



( 3 ) 



( 4 ) 



( 5 ) 



where B = {{p^q} G A/' | 7^ /g} is a set of all pairs of neighboring pixels 

disconnected in the optimal solution Note that the summations on both 
sides of (5) show up because each pair of pixels in B is encountered twice when 
summing up the terms in (4) over a ^ C. Finally, since ^ 

then (5) implies that Ep{f^) < 2Ep{f^). 



3.4 Graph Cuts 

The key part of the algorithm is step 3.1, where graph cuts are used to efficiently 
find /. Let Q = (V, £) be a weighted graph with two distinguished vertices called 
the terminals. A cut C C £ is a set of edges such that the terminals are separated 
in the induced graph Q{C) = (V,f — C). In addition, no proper subset of C 
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separates the terminals in Q{C). The cost of the cut C is denoted by \C\ and 
equals the sum of its edge weights. 

The minimum cut problem is to find the cut with smallest cost. This problem 
can be solved very efficiently by computing the maximum flow between the 
terminals, according to a theorem due to Ford and Fulkerson [8]. There are a 
large number of fast algorithms for this problem (see [1], for example). The worst 
case complexity is low-order polynomial; however, in practice the running time 
is nearly linear. 

Step 3.1 uses a single minimum cut on a graph whose size is 0{\V\). The 
graph is dynamically updated after each iteration. The next section describes 
the details of our graph cut technique that allows efficient implementation of 
step 3.1. 

4 Finding the Optimal Expansion Move 

Given an input labeling / (partitioning P) and a label a, we wish to find a 
labeling / that minimizes Ep over all labelings within one ^-expansion of /. 
This is the critical step in the algorithm given at the bottom of figure 1. Our 
technique is based on computing a labeling corresponding to a minimum cut on 
a graph = (V«, The structure of this graph is determined by the current 
partitioning P and by the label a. The graph dynamically changes after each 
iteration. 

This section is organized as follows. First we describe the construction of 
Ga foi* a given / (or P) and a. We show that cuts C on Qa correspond in a 
natural way to labelings which are within one a-expansion move of /. Then, 
based on a number of simple properties, we define a class of elementary cuts. 
Theorem 1 shows that elementary cuts are in one to one correspondence with 
the set of labelings that are within one ^-expansion of /, and also that the cost 
of an elementary cut is \C\ = Ep(f^). A corollary from this theorem states our 
main result that the desired labeling / equals where C is a minimum cut on 
Ga- 

The structure of the graph is illustrated in Figure 2. For legibility, this figure 
shows the case of ID image. In fact, the structure of Ga will be the same for 
any image. The set of vertices includes the two terminals a and d, as well as all 
image pixels p £ V. In addition, for each pair of neighboring pixels {p, q} G Af 
separated in the current partition (i.e. fp ^ fq) we create an auxiliary vertex 
Auxiliary nodes are introduced at the boundaries between partition sets 
Vi for I G C. Thus, the set of vertices is 



V. 



aUaUVU 



U 






Af 

. fp^fq 



Each pixel p G P is connected to the terminals a and a by edges t^ and 
correspondingly. For brevity, we will refer to these edges as t-links (terminal 
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Fig. 2. An example of the graph for a ID image. The set of pixels in the 
image is V = {p,q,r,s} and the current partition is P = {Pi,p2,Pa} where 
Pi = {p}^ P2 = and Va = {«§}• There are two auxiliary nodes a = 

^ ~ a introduced between neighboring pixels separated in the current 
partition. Auxiliary nodes are added at the boundary of sets Vi. 



links). Each pair of neighboring pixels {p, q} £ Af which is not separated by the 
partition P (i.e. fp = fq) is connected by an edge which we will call an 

n-link (neighborhood link). For each pair of neighboring pixels {p, g} G Af such 
that fp 7^ fq we create a triplet of edges 



where a = ci^p^q^ is the corresponding auxiliary node. The n-links and 

^%a,qO connect pixels p and q to and the t-link connects the auxiliary 

node g<^to the terminal a. Finally, we can write the set of all edges as 





The weights assigned to the edges are shown in the table below. 
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edge 


weight 


for all 


K\ 


Dp{a) 


P^Va 


m 


OO 


K\ 


Dp(a) 


P ^ Va 


\m 


Dpifp) 


^fp,aO = kfa,gO ~ 




{p, q} eU, fp^ fq 






{p, q} eAf, fp = fq 



Any cut C on the graph Qo^ must sever (include) exactly one t-link for any 
pixel p G P: if neither t-link were in C, there would be a path between the 
terminals; while if both t-links were cut, then a proper subset of C would be a 
cut. Thus, any cut includes either or for each pixel p e V. This defines a 

natural labeling corresponding to a cut C on Qo^. Formally, 



f 

Jp 



a [ftp e C 
fp if G C 



ypeV. 



(6) 



In other words, a pixel p is assigned label a if the cut C severs t-link while 
p is assigned its old label fp if C severs tp. The terminal a stands for the new 
label and the terminal a stands for the old labels assigned to pixels in the initial 
labeling /. 

Lemma 1. A cutC on corresponds to a labeling which is one a-expansion 
away from the original labeling f . 



Proof A cut C cannot sever t-links tp for any pixel p G Va due to the infinite 
cost. Thus, fp=a for any p For any pixel p ^ Va the value of fp can be 

either a or fp. □ 



It is easy to show that a cut C severs an n-link between neighboring 

pixels {p,q} G Af such that fp = fq if and only if C leaves the pixels p and q 
connected to different terminals. Formally, for any cut C 

Property 1. If tp,tq ^ C then C. 

Property 2. If tp,tq ^ C then C. 

Property 3.1 Iftp.tq then e^^g^G C. 

Property 3.2 litp.tq ^ C then e^^^^G C. 

The first two properties follow from the requirement that no proper subset of C 
should separate the terminals. Properties 3.1 and 3.2 also use the fact that a cut 
has to separate the terminals. 

These properties are illustrated in Figure 3. The following lemma is a conse- 
quence of properties 1-3 above and equation 6. 
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Fig. 3. Properties of a cut C on Qo, for two pixels p^q ^ M such that fp = fq. 
Dotted lines show the edges cut by C and solid lines show the edges remaining 
in the induced graph Ga{C) = {Va,£a ~ C). 



Lemma 2. If {p, q} £ Af and fp = fq then any cut C on Qa satisfies 

\£ D ^1b,<7ol ~ ^ifp 7^ fq)' 



Consider now the set of edges corresponding to a pair of neighboring 

pixels {p^q} G JV such that fp ^ fq. In this case, there are several different 
ways to cut these edges even when the pair of severed t-links at p and q is fixed. 
However, a minimum cut C on Qa is guaranteed to sever the edges in 
depending on what t-links are cut at the pixels p and q. 

The rule for this case is described in properties 4-6 below. Assume that 
a = av^p^q^ is an auxiliary node between the corresponding pair of neighboring 
pixels. Then a minimum cut C on Qo^ satisfies the following properties. 

Property 4. If G C then C H £%p,q<^ = 

Property 5. If t^.t^ G C then C H £%p,q<}= 0- 
Property 6.1 If G C then Cn£^p^q^= e^^aO 

Property 6.2 If G C then C e,^a,qO- 

The first property results from the fact that no subset of C is a cut. The others 
follow from the minimality of \C\ and the fact that ^%a,q<> ta have 

identical weights. These properties are illustrated in figure 4. 

Lemma 3. If {p, q} e JV and fp ^ fq then the minimum cut C on satisfies 



\C n - ^%p,q0' ^ifp 7^ fq)' 

Proof. The equation follows from properties 4-6 above and equation (6). Note 
that fp 7^ fq implies that a is the only common label that a cut on Qa can 
assign to p and q using our convention in (6). That is, f^ = if and only if 
both fp=a and f^ = a. □ 
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fa , 

ip . 



fa 



cut 




Property 4 




Property 5 Property 6.1 (6.2) 



Fig. 4. Properties of a minimum cut C ou Qo, for two pixel p,q e such that 
fp 7 ^ fq- Dotted hues show the edges cut by C aud solid hues show the edges iu 
the iuduced graph Ga{C) = {Va^£a ~ C). 



Note that the peualty is imposed wheuever fq - This is exactly 

what the auxiliary pixel coustructiou was desigued for. We had to develop a 
special trick for the case wheu the origiual labels for p aud q do uot agree 
{fp 7^ fq) order to get the same effect that lemma 2 establishes for the simpler 
situatiou wheu fp = fq. 

Properties 1-3 hold for auy cut, aud properties 4-6 hold for a miuimum 
cut. However, there cau be other cuts besides the miuimum cut that satisfy all 
six properties. We will defiue a elementary cut ou Qo, to be a cut that satisfies 
properties 1-6. 

Theorem 1. Let the graph he eonstrueted as above for a given f and a. 
Then there is a one to one eorrespondenee between the set of all elementary euts 
on Got the set of all labelings within one a-expansion of f. Moreover, for 
any elementary eut C we have \C\ = Ep(f^). 

Proof We first show that au elemeutary cut C is uuiquely determiued by the 
correspoudiug labeliug /^. The label at the pixel p determiues which of 
the t-liuks to p is iu C. Properties 1-3 show which n-liuks betweeu pairs 

of ueighboriug pixels {p,q} such that fp = fq should be severed. Similarly, 
properties 4-6 determiue which of the liuks iu f correspoudiug to {p, q} G Af 
such that fp 7^ fq should be cut. 

We uow compute the cost of au elemeutary cut C, which is 

1^1 = ^ {'^p^'^p}\ + X] + X] 

pU V V V 

fp=fq /pB=/q 

It is easy to show that for auy pixel p G V we have \C D {tptp}\ = Dp{fp). 
Lemmas 2 aud 3 hold for elemeutary cuts, siuce they were based ou properties 
1-6. These two lemmas give us the secoud aud the third terms iu (7). Thus, the 
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total cost of a elementary cut C is 

|C| = ^ 5{f^ ^ f^) = Ep{f). 

pQP Hp.gCaAT 

Therefore, |C| = £;p(/‘^). □ 

Our main result is a simple consequence of this theorem, since the minimum 
cut is an elementary cut. 

Corollary 1. The optimal a- expansion move from f is f = where C is the 
minimum eut on • 

5 Experimental Results 

In this section we apply our method to the stereo matching problem. We compare 
our method with simulating annealing, using real image pairs, including one with 
dense ground truth. For Dp we use the method of [3] to reduce the effects of 
image sampling. We select using the information present in a single image, 

as described in [6]. 

We experimented with several variants of simulated annealing, including both 
the standard (Metropolis) sampler and the Gibbs sampler. Our comparative data 
uses the annealing variant and the choice of cooling schedule that best minimized 
the energy. Simulated annealing is quite sensitive to the starting point, so we 
initialized it using the results of normalized correlation. Our methods give very 
similar answers regardless of the starting point, but we used the same starting 
point as annealing to make the comparison fair. All running times are given in 
seconds, on a 200 MHz Pentium Pro. 

Figure 5(a) shows the left image of a real stereo pair where the ground truth 
is known at each pixel. We obtained this image pair from the University of 
Tsukuba Multiview Image Database. The ground truth is shown in figure 5(b). 
Our results are shown below, both for the expansion move algorithm presented 
in this paper and the swap move algorithm introduced in [6]. For comparison, 
we also show the results from simulated annealing, as well as from normalized 
correlation (using the window size that minimizes the number of errors). Figure 7 
shows the performance of algorithms as a function of time, both in terms of 
energy and in terms of accuracy with respect to the ground truth. 

Figure 6 shows the performance of expansion move algorithm on the CMU 
meter image, along with the results of simulated annealing. The performance in 
terms of energy is similar to the results shown in figure 7(a). 
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(d) Swap move method (e) Expansion move method (f) Normalized correlation 
Fig. 5. Performance on real imagery with ground truth 




(a) Left image (b) Expansion move algorithm (c) Simulated annealing 

Fig. 6. CMU meter imagery results 
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Fig. 7. Performance comparison with simulated annealing, on the imagery shown 
in figure 5(a). Comparison is done in terms of energy (top) and accuracy with 
respect to the ground truth (bottom). Each data point for our methods corre- 
sponds to a cycle. 
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Convergence of a Hill Climbing Genetic 
Algorithm for Graph Matching 
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Abstract. This paper presents a convergence analysis for the problem 
of consistent labelling using genetic search. The work builds on a recent 
empirical study of graph matching where we showed that a Bayesian 
consistency measure could be efficiently optimised using a hybrid ge- 
netic search procedure which incorporated a hill-climbing step. In the 
present study we return to the algorithm and provide some theoretical 
justification for its observed convergence behaviour. The main conclu- 
sion of this study is that the hill-climbing step significantly accelerates 
convergence, and that the convergence rate is polynomial in the size of 
the node-set of the graphs being matched. 



1 Introduction 

Configurational optimisation problems permeate all fields of machine intelli- 
gence. Broadly speaking, they are concerned with assigning symbolic or dis- 
cretely defined variables to sites organised on a regular or irregular network 
in such a way as to satisfy certain hard constraints governing the structure of 
the final solution. The problem has been studied for over three decades. Con- 
crete examples include the travelling salesman [7] and N-queens problems [8] 
together with a variety of network labelling [9,10] and graph-matching [16] or 
graph colouring problems. The search for consistent solutions has been addressed 
using a number of computational techniques. Early examples from the artificial 
intelligence literature include Mackworth’s constraint networks [9,10], Waltz’s 
use of discrete relaxation to locate consistent interpretations of line drawings 
[17], together with a host of applications involving the A* algorithm [11,20]. 
More recently, the quest for effective search strategies has widened to include al- 
gorithms which offer improved global convergence properties. Examples include 
the use of simulated annealing [5,7], mean-field annealing [19], tabu-search [14] 
and most recently genetic search [3]. 

Despite stimulating a large number of application studies in the machine in- 
telligence literature, the convergence properties of these modern global optimisa- 
tion methods are generally less well understood than their classical counterparts. 
Eor instance, in the case of genetic search, although there has been considerable 
effort directed at understanding the convergence for infinite populations of linear 
chromosomes [12,13,15], little attention has been directed towards understand- 
ing the performance of the algorithm for discrete entities organised on a network 
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structure. In a recent study we have investigated the use of genetic search for 
graph matching [ 2 ] . Here we have used a hill-climbing genetic search procedure to 
optmise a Bayesian measure of for gauging relational consistency [ 18 ]. The main 
conclusions of our study were threefold. First, the consistent labelling of graphs 
was only amenable to genetic search if a hill-climbing operator was incorporated. 
Second, the quality of the final solution was greatly improved if crossover (or 
genetic recombination) was conducted by exchanging connected subgraphs. Fi- 
nally, we found the optimisation process to be relatively insensitive to the choice 
of mutation rate. 

Unfortunately, our analysis of the empirical results has hitherto been ex- 
tremely limited and has been couched only in terms of a rather qualitative model 
of the pattern-space in which configurational optimisation is performed [ 2 ] . This 
has meant that we have been unable either to predict the convergence behaviour 
or to account for the three interesting empirical properties listed above. The aim 
in this paper is to remedy this shortcoming by presenting a detailed analysis of 
algorithm behaviour. It is important to stress that although there have been sev- 
eral analyses of genetic search, these differ from the study described here in three 
important ways. First, we are concerned specifically with the graph-matching 
problem. This means that we present an analysis that is more pertinent to the 
consistent labelling problem where there is network organisation rather than a 
linear chromosome. Second, we pose our analysis in terms of discrete assignment 
variables rather than continuous ones. 

2 Relational Graphs 

Central to this paper is the aim of matching relational graphs represented in 
terms of configurations of symbolic labels. We represent such a graph by G = 
(V,E), where V is the symbolic label-set assigned to the set of nodes and E 
is the set of edges between the nodes. Formally, we represent the matching of 
the nodes in the data graph Gi = (Vi^Ei) against those in the model graph 
G2 = (U2,F'2) by the function f : Vi ^ V2. In other words, the current state of 
match is denoted by the set of Cartesian-pairs constituting the function /. 

In order to describe local interactions between the nodes at a manageable 
level, we will represent the graphs in terms of their clique structure. The clique 
associated with the node indexed j consists of those nodes that are connected 
by an edge of the graph, i.e. Cj = j U {i G Ui|(i, j) G Ei}. The labelling or 
mapping of this clique onto the nodes of the graph G2 is denoted hy Ej = {f{i) G 
V2,Vi G Cj}. Suppose that we have access to a set of patterns that represent 
feasible relational mappings between the cliques of graph Gi and those of graph 
G2. Typically, these relational mappings would be configurations of consistent 
clique labellings which we want to recover from an initial inconsistent state of 
the matched graph Gi. Assume that there are Zj relational mappings for the 
clique Cj which we denote hy = {Xf G V2,Vi G Cj} where G {l, 2 ...Zj} 
is a pattern index. According to this notation G V2 is the match onto graph 
G2 assigned to the node i G Ui of graph Gi by the relational mapping. 
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The complete set of legal relational mappings for the clique Cj are stored in a 
dictionary which we denote by 0j = {A^\ji = 1, 2, Zj}. 

The basic measure of consistency underpinning our method is the Hamming 
distance between the matched supercliques of the data-graph and the consistent 
relational mappings residing in the dictionary [18]. Using the ingredients outlined 
above, the Hamming distance between the superclique matching configuration 
Fj and the dictionary item is H{Fj,A^) = (1 — )• 

According to Wilson and Hancock [18], the probability of match is found 
by assuming that matching errors occur with a uniform and memoryless error 
probability Pg. As a result 



h ' 

Pir) = y.F ( 1 ) 

^ /x=l 

where 6 = (1 — Pe)^-^*and ke = In^^^. We use as our global measure of 
consistency the average of the clique configurational probabilities i.e. Pq = 

Our recent paper [2], showed how a hill-climbing genetic search procedure 
could be used to optimise this global consistency measure. In essence the ap- 
proach relies on generating a population of random initial matching configura- 
tions. These undergo cross-over, mutation and selection to locate the optimal 
configuration of correspondence assignments between the two graphs. The main 
empirical findings concerning convergence were as follows: 

— The method was relatively insensitive to mutation rate. In fact, provided that 
the mutation probability did not exceed 0.600, then the number of iterations 
required for convergence was approximately constant. 

— The addition of the hill climbing step considerably reduced the number of 
iterations required for convergence. 

— Once the population size exceeded a critical value, then convergence rate 
was essentially independent of population size. 

— The number of iterations required for convergence was approximately poly- 
nomial in the number of graph- nodes. 

The aim in the remainder of this paper is to provide an analysis which supports 
these empirical findings. 

3 Distribution Analysis 

We formulate our investigation of graph-matching as a discrete time process with 
the states defined over the state-space of all possible correspondences between 
a pair of graphs. Our analysis of the population is a statistical one, in which we 
assume that the population is sufficiently large that we can invoke the central 
limit theorem. For this reason we direct our attention to the modelling of the 
probability density function for the distribution of solution-vectors. 
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3.1 Formal Ingredients of the Model 

Suppose that the index-set for the population of solutions is V and that a is 
the solution index. For the problem of graph matching, each solution vector 
belonging to the solutions \ V\ ^ V 2 represents the labelling of the nodes 
of a data-graph Vi with the nodes of a model graph V 2 . In order to simplify 
the analysis, we will assume that the pair of graphs have an identical number of 
nodes, he. |F| = |Fi| = IF 2 I. 

To commence our modelling of the distribution of solution- vectors, we focus 
our attention on the fraction of mappings that are in agreement with known 
ground truth. If the configuration of ground-truth correspondence matches is 
denoted by /, then the fraction of correctly assigned matches for the solution- 
vector indexed a is equal to 

= ■(.)./(.)) ( 2 ) 

iUV 

where a is a population index of the solution- vector, n is the iteration number 
and 6 is the Kronecker delta function. A solution- vector in which each of 
the matches is correct would have = 1. By contrast a solution vector in 
which none of the correspondence matches are correct would have = 0. In 
order to analyse how the genetic graph-matching process performs, we wish to 
evaluate the distribution of F^^^ over the entire population of candidate solution 
vectors. At iteration n, we denote the distribution of fractional matching-error 

byPy)(F = 7). 

The overall goal of our analysis is to model how the distribution of the fraction 
of correct matches evolves with iteration number. For reasons of tract ability, we 
largely confine our attention to understanding how the mean fraction of correct 
matches changes with iteration number. The quantity of interest is 

E (3) 

\ aUV 



We commence our analysis by assuming that the number of correct matches 
in the initial population follows a binomial distribution. This is justifiable pro- 
vided that the initial matching errors are governed by a Bernoulli process. If 
the initial population is chosen in a random fashion, then at the outset the ex- 
pected fraction of correct matches is given by F^^^ = Under the binomial 

assumption, the number of correct matches has mean and standard de- 
viation As a result the initial number of correct matches r 

is distributed in the following manner 



P(r) 



r\{\V\-ry} ^ 



(4) 



Since we are interested in the fraction of correct matches, we turn our attention 
to the distribution of the random variable 7 = By appealing to the central 
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limit theorem under the assumption that the graphs have large numbers of nodes, 
we can replace the binomial distribution of the number of correct matches by 
a Gaussian distribution of the fraction of correct matches. In practice we deal 
with graphs whose size exceeds 40 nodes, and so this approximation will be a 
faithful one. The probability distribution for the fraction of correct solutions in 
the population is therefore 



pW(^(n) 



v'27r|r|F(o)(l - F(o)) 



exp 



(|r|7-|r|F(o))2 \ 
2|r|F(0)(l-F(0)) J 



(5) 



Because the distribution is Gaussian, the mode is located at the position 7 = 



4 Genetic Operators 

In this section we will investigate the role that the three traditional genetic oper- 
ators, i.e. mutation, crossover and selection, play in the evolution of the average 
fraction of correct solutions in the genetic population. We will supplement this 
analysis with a discussion of the hill-climbing process. At this stage, our inter- 
est lies not with the prediction of the collective behavior of the operators, but 
with the effect that each one has in isolation upon the population. Collective 
behaviour is the topic of Section 5. 



4.1 The Mutation Operator 

The goal of the mutation operator is to increase population diversity by perform- 
ing a stochastic state swapping process on the individual node matches residing 
on each of the different solutions that constitute the current population. This 
process proceeds independently for both the individual nodes and the individ- 
ual solutions. This is in contrast with the crossover which serves to exchange 
information between pairs of solutions in order to form new individuals that are 
admitted to the population on the basis of their fitness. The uniform probabil- 
ity of any match undergoing a random state swapping process is Pm- For each 
individual solution vector, there are three possible transitions that can occur in 
the state of match. First, an individual mutation could increase the number of 
correct matches by one unit; in this case the increase in the fraction of correct 
matches is Pm(l ~ The second possible outcome is a reduction in the 

number of correctly assigned matches by one unit; in this case the decrease in the 
fraction of correct matches is PmP^^ • Finally, the mutation could leave the 
number of correct correspondences unchanged; in this case the fraction of correct 
matches remains unchanged at the value For moderate mutation rates, the 
most likely change to the fraction of matches is due to the second transition. 
This corresponds to a disruption of the set of correctly assigned matches. 
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We are interested in the effect that the mutation operator has upon a solution- 
vector in which the fraction of correct matches is at iteration n. In par- 
ticular, we would like to compute the average value of the fraction of correct 
matches at iteration n -k 1. Based on the three assignment transitions outlined 
above, the new average fraction of correct matches is 

r(n+l) _ r(n) I p /i_ p p{n) 1^1 ~ ^ 

~ ^ -Lm\^ a / |y| ^mJ^cx |y| 

After some straightforward algebra, we can re-write this recursion formula in 
terms of the fraction of matches correct at the outset, i.e. . As a result, the 
average fraction of correct matches at iteration n is 

= yk + (F(0) - ±) exp(-fc^n) (7) 

where km = In ^ • There are a number of interesting features of this formula 
that deserve further comment. First, the equation represents an exponential 
decay that tends towards a minimum value of i.e. the probability of randomly 
assigning a correct match. The rate of decay is determined by the logarithm of 
the probability that a mutation operation does not take place, i.e. 1 — Pm- In 
qualitative terms, the mutation process represents an exponential drift towards 
highly unfit solutions. The rate of drift is controlled by two factors. The first of 
these is the mutation rate Pm- As the mutation probability increases, then so the 
disruptive effect of the operator become more pronounced. The second factor is 
the initial fraction of correct matches that were present prior to mutation. As 
this initial fraction of correct matches increases, then so does the disruptive effect 
of the mutation operator. The effect of this second drift process is to impose a 
higher rate of disruption on solutions in the population that are approaching a 
consistent state. Poor or highly inconsistent solutions, on the other hand, are 
not significantly affected. This latter drift effect can be viewed as a natural 
mechanism for escaping from local optima that may be encountered when the 
global solution is approached in a complex fitness landscape. 

Figure 1 illustrates how the peak of the population distribution drifts towards 
the origin in the fashion predicted. Moreover, the width of the distribution be- 
comes narrower as it approaches the origin. In order to quantify this process 
we plot the most probable fraction of correct matches in the population against 
iteration number. This plot is shown in Figure 2. We note that there is a good 
agreement between our prediction of exponential decay and what is observed 
experimentally. 

4.2 An Analysis of the Selection Operator 

In contrast with the mutation operator which is a uniform random process, 
selection draws upon the fitness function to determine the probability that the 
different solution vectors survive into the next population generation. Because of 
this task-specific nature of the fitness function, it is not possible to undertake a 
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Fig. 1. A numerical simulation of the distribution of correct solutions in the 
population using only the mutation operator. 




Fig. 2. The comparison of the analytic predictions of the mutation operator with 
the simulation run results. 



general analysis of the selection process. Moreover, in the case of graph-matching 
the compound exponential structure of our fitness measure further complicates 
the analysis. To overcome this problem, we present an approximation to our 
Bayesian consistency measure which allows us to relate the survival probability 
to the fraction of correct matches. This approximate expression for the survival 
probability turns out to be polynomial in the fraction of correct matches. 

We commence by writing the fitness using the expression for the super-clique 
matching probability given in equation (1). To make the role of error-probability 
more explicit, we re-write the matching probability in terms of Kronecker delta 
functions that express the compatibility between the current matching assign- 
ments and the consistent matches demanded by the configuration residing in the 
dictionary. As a result 
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Our aim is to compute the average value of the global consistency measure, 
Pq. Because the consistency function averages the matching probability P{Pj)^ 
the expected- value of the global probability is equal to 
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We now note that the expected value of the exponential function under the prod- 
uct can be re-expressed in terms of the assignment probabilities in the following 
manner 



E 






= Pe P{f{i) ^ Af) + (1-Pe) Pirn = Af) (10) 



As a result, the expected- value of the global matching probability, i.e. the prob- 
ability of survival, is equal to 



pg = tE nl^-^ 7 (*) 7 An + (i-^e)p(/(o = An| ( 11 ) 

A^noj incj ^ ^ 

Unfortunately this expression still contains reference to the dictionary of 
structure preserving mappings. In order to further simplify matters, we observe 
that when the configuration of assigned matches becomes consistent then, pro- 
vided the error probability Pe is small, we would expect the sum of exponentials 
appearing in equation (1) to be dominated by the single dictionary item that is 
fully congruent with the ground-truth match. The remaining dictionary items 
make a negligible contribution. Suppose that fi is the index of the correctly 
matching dictionary item, then we can write 



exp[-keH{Pj,A^)]:^ exp[-keH{Pj,Af^)] ( 12 ) 

A^U0j0A^ 

We can now approximate the super-clique matching probability by consider- 
ing only the dominant dictionary item. This allows us to neglect the summation 
over dictionary items. Finally, we note that the average value of the probability 
of correspondence match, i.e. P{f{i) = A^), is simply equal to the fraction of 
correct matches E^\ By assuming that all super-cliques are of approximately 
the same average cardinality, denoted by |C|, we can approximate the global 
probability of match in the following manner: 
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(13) 



In other words, our measure of relational consistency is polynomial in the 
fraction of correct matches. Moreover, the order of the polynomial is equal to 
the average node connectivity \C\. As the average neighbourhood size or node 
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connectivity in the graphs increases, then so the discriminating power of the cost 
function becomes more pronounced. 

At iteration n of the algorithm, we can use the selection probability to com- 
pute the distribution of the fraction of correct matches using the relationship 

pin)^p(n+l) ^ ^ P^{p)Pa{^r ( 14 ) 

Substituting for the approximate initial distribution given in Equation 5 to- 
gether with our approximation to the cost function from Equation 13, we find 



pW(^(n) 
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X (Pe(l-7) + (l-a2)7)^*'*^ (15) 



The required distribution is simply a Gaussian distribution that is modulated 
by a polynomial of order |C|n. This demonstrates that the average fraction of 
correct matches in the population will tend to increase as the value of n increases. 
In other words, the iteration process improves the fraction of correct matches. 
By confining our attention to the solutions that occur most frequently in the 
population, we can track the iteration dependence of the mode or peak, F^nlx^ 
of the distribution of correct matches. To locate the most frequently occurring 
solution in the population, we proceed as follows. Eirst, we evaluate the deriva- 
tive of the distribution function in Equation 15 with respect to the fraction of 
correct mappings, i.e. 7 . Next, we set the derivative equal to zero. By solving 
the resulting saddle-point equation for and after rejecting the non-physical 
values of 7 that fall outside the interval [ 0 , 1 ], we find that the maximum value 
Pd is located at the position 



max 
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where ki = 2 f(o)\T^f(Q)) = 1 - 2 Pg. 

With this model of the iteration dependance of Pj^ax under the selection 
operator to hand, we are in a position to compute the number of iterations 
required for algorithm convergence. Our convergence condition is that the modal 
fraction of correct matches is unity. We identify the value of the iteration index n 
that satisfies this condition by setting F^ax = 1 in Equation 16. Eurthermore, we 
assume that the initial population is randomly chosen and as a result F^^^ = 
Solving for n, we find the number of iterations required for convergence to be 
equal to 
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Fig. 3. A numerical simulation of the distribution of correct solutions in a genetic 
population using only the selection operator. 



In other words, commencing from a simple model of the selection process that 
uses a number of domain specific assumptions concerning our Bayesian consis- 
tency measure, we have shown that we would expect the number of iterations 
required for convergence to be polynomial with respect to the number of nodes 
in the graphs under match. 

In order to provide some justification for our modelling of the population 
mode, we have investigated how the selection operator modifies the distribution 
of correct correspondences in the genetic population. Figure 3 shows the distri- 
bution as a function of the iteration number. The main point to note is that 
the width of the distribution remains narrow as the iterations proceed. This is 
because only the selection operator is used. It is important to stress that there 
is no diversification process at play. 



4.3 An Analysis of the Hill-climbing Operator 

Since the hill-climbing operator is only used to make local changes that increase 
Pg, it is clear that it can only improve the quality of the match. In this section 
of our analysis, we aim to determine to what extent the hill-climbing operator 
effects the overall convergence rate of our algorithm. 

Modelling the behavior of the gradient ascent step of the algorithm is clearly 
a difficult problem since it is highly dependent on the local structure of the 
global landscape of the fitness measure. One way of simplifying the analysis is 
to adopt a semi-empirical approach. Here we aim to Monte- Carlo the gradient 
ascent process and extract a parameterisation of the iteration dependance of the 
required distribution parameters. Our starting point is to generate 1000 random 
graphs. Commencing from a controlled fraction of initially correct matches, we 
perform gradient ascent until the configuration of matches stabilises and no more 
updates can be made. We plot the final fraction of correct matches against the 
fraction initially correct in Figure 4. The best-fit to the data gives the following 
iteration dependance 
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Fig. 4. Empirical results demonstrating how we expect gradient ascent to per- 
form. The dotted curve represents the best fit that was found. 



(18) 

This result relates the fraction of correct matches at iterations n and n + 1 
resulting from the application of the hill-climbing operator. 

By expanding the recursion in iteration number, we can obtain the depen- 
dance on the initial fraction of correct matches. At iteration n, the fraction of 
correct matches is given by 



= (19) 

This assumes we can immediately restart the gradient ascent with the so- 
lutions from the previous iteration. Since at each iteration gradient ascent is 
performed until a local maximum is reached, the convergence rate calculated 
in this section is mainly of theoretical value. Also, when comparing the conver- 
gence rate of selection and hill-climbing, the number of function evaluations per 
iteration clearly should be taken into account as well. 

We can use the empirical iteration dependance of the expected fraction of 
correct solutions to make a number of predictions about the convergence rate of 
population based hill-climbing. We commence by assuming that the initial set of 
matches is selected in a random manner. As before, this corresponds to the case 
Our condition for convergence is that less than one of the matches 

per solution is in error, i.e. By substituting this condition into 

equation 19 and solving for n, we find 



n = 0.36 



ln(|E|) 



(20) 



The number of iterations required for convergence increases slowly with 
graph-size. For modest numbers of nodes the increase is approximately linear. 

It is interesting to contrast the dependance on graph-size with that for the 
selection operator. Whereas hill-climbing has a slow dependance on graph-size. 
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in the case of selection there is a more rapid polynomial dependance. As a figure 
of merit, for a graph of size 50 nodes, the number of iterations required by 
selection is a factor of 10 larger than that required by hill-climbing. 

5 An Analysis of the Combined Operators 

In this section we provide an analysis of the combined effect of the mutation, se- 
lection and hill-climbing operators. Since the effect of crossover is simply to blur 
the fitness distribution, we omit it from our analysis. We provide convergence 
analysis for both the standard genetic algorithm and a hill-climbing variant. 



5.1 Standard Genetic Search 



Our aim is to extend the analysis of the individual operators presented in Section 
6 by deriving a sufficient condition that ensures a monotonic increase of the 
expected fraction of correct solutions when composite operators are applied. To 
embark on this study, we must first consider the order in which the different 
genetic operators are applied. As the population of candidate solutions enters 
the new iteration n + 1 from the preceding iteration n, we first perform the 
crossover operation. As we have discussed earlier, this process results in a post- 
crossover population that is distributed according to a Gaussian distribution. 
As a result the mode of the distribution is located where the fraction of correct 
solutions is equal to F^'^\ while we let the standard deviation of the distribution 
be equal to The mutation operator is applied after the crossover process. 
The main effect is to shift the mode of the Gaussian distribution to a lower 
fitness value by an amount 



-pimutation 



\V\ 



( 21 ) 



As expected, there is a decrease in the expected fraction of correct solutions. 
Immediately following the application of the mutation operator, we do not know 
the exact distribution of the fraction of correct matches. However, as demon- 
strated earlier, we know that for a large number of mutation operations, the 
distribution is binomial, which in turn can be approximated well by a Gaussian 
for large \V\. As a result, the required distribution can be approximated in a 
Gaussian manner. The mode of the distribution is located at the position 



pimutation piF) ^prautation ( 22 ) 

If the mutation probability Pm is relatively small, as is usually the case, after a 
single mutation operation, then we can assume that the standard deviation of 
the Gaussian, i.e. remains unchanged. 

In order to determine how the peak or mode of this distribution is shifted 
by the selection operator we recall Equation 16. Our interest is now with the 
change fraction of correct matches that the peak of the distribution undergoes 
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under the combined selection and mutation operators. This quantity is equal to 
the peak value offset by the shift due to mutation, i.e. 

^ptselection jpselection ^^mutation (23) 

It is important to note that the distribution used as input to the selection 
operator is the result of the sequential application of the crossover and mutation 
processes. Computing distribution shift after selection is straightforward, but al- 
gebraically tedious. For this reason we will not reproduce the details here. Given 
that we now have a prediction of how we expect the peak of the population dis- 
tribution to evolve under the processes of crossover, mutation and subsequent 
selection, we are in a position to construct a condition for monotonic conver- 
gence. Clearly, for the population to converge, the downward shift (i.e. fitness 
reduction) due to the mutation operator must be smaller than the upward shift 
(i.e. fitness increase) resulting from selection. In order to investigate this balance 
of operators, we consider the break-even point between mutation and selection 
which occurs where 

^jpselection ^pnnutation (24) 

Substituting from Equations 21 and 23, and solving for P^, the break-even 
condition is satisfied when 



It is important to note that this condition on the mutation probability is 
very similar to that derived by Qi and Palmieri [12,13]. In fact, the maximum 
mutation rate is proportional to the ratio of the variance of the fraction of correct 
mappings in the population to the current expected fraction of matches correct. 
Moreover, the limiting value of the mutation is proportional to the total number 
of edges in the graphs, i.e. |E|-|C|. Finally, we note that as the fraction of correct 
matches increases, then the mutation rate must be reduced in order to ensure 
convergence. It is important to emphasize that we have confined our attention to 
deriving the condition for monotonic convergence of the expected fitness value. 
This condition does not guarantee that the search procedure will converge to 
the global optimum. Neither does it make any attempt to capture the possibility 
of premature convergence to a local optimum. Moreover, the analysis assumes 
that all members of the population participate in crossover at each iteration, 
and that enough crossover is performed to arrive at an approximately Gaussian 
distribution. This may not be realistic since fitness distributions are typically 
observed to be quite asymmetric. 
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(25) 



5.2 Hybrid Hill Climbing 

Having derived the monotonic convergence condition for the combined effect 
of the three standard genetic operators, we will now turn our attention to the 
hybrid hill-climbing algorithm used in our empirical study of graph-matching 
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[2]. As before, we compute the change in the fraction of correct matches that 
we would expect to result from the additional application of the hill-climbing 
operator. Since this step immediately follows mutation, the population shift is 
given by 



yy ■pphillclimb 



= 1 - (1 - F 



fmutation\2.8 



-^mutation 



(26) 



For completeness, our analysis should next focus on the effect of the selection 
operator. However, as we demonstrated in Section 6.4, the rate of convergence 
for the selection operator is significantly slower than that of the hill-climbing 
operator. This observation suggests that we can neglect the effects of selection 
when investigating the hybrid hill-climbing algorithm. 

In order to identify the monotonic convergence criterion for the hybrid hill- 
climbing algorithm, we focus on the interplay between opposing population shifts 
caused by mutation and hill climbing. This analysis is entirely analogous to 
the case presented in the previous subsection where we consider the interplay 
between mutation and selection for the standard genetic algorithm. In the case 
of the hybrid hill climbing algorithm, the break-even occurs when 



^pihillclimb ^ -ppmutation 



(27) 



When the size of the graphs is large, i.e. \V\ ^ 1, the convergence condition on 
the mutation probability is 



Pm < 



^(l — -k 






(28) 



This limiting mutation rate is plotted in Figure 5 as a function of F^'^\ In 
practice, we must select the operating value of Pm to fall within the envelope 
defined by the curve in Figure 5. As the fraction of correct matches approaches 
unity (i.e. the algorithm is close to convergence), then the mutation rate should 
be annealed towards zero. More interestingly, we can use the convergence condi- 
tion to determine the largest value of the mutation rate for which convergence 
can be obtained. By taking the limit as the fraction of correct matches ap- 
proaches zero, we find that lim^(n)Q o Pm = 0.6430. This agrees well with the 
empirical findings reported in our previous work [2]. 

Before concluding, we return to the question of population size. The aim 
here is to use our Monte-Carlo study to assess how the theoretical predictions, 
made under the central limit assumption for large population size, degrade as the 
population size becomes relatively small. In Figure 6 we plot F^'^^ as a function 
of iteration number for increasing population sizes. It is clear that beyond a 
population size of about 50 solutions, the convergence curves for the different 
population sizes become increasingly similar. 



6 Conclusions 

In this paper our aim has been to understand the role that the different genetic 
operators play in the convergence of a genetic algorithm. The mutation operator 
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Fig. 5. The maximum mutation rate that may be used to ensure monotonic 
convergence of a hybrid Genetic Hill-climbing optimisation scheme. 




Fig. 6. The expected value for under the traditional genetic operators for 
various population sizes. 

was found to produce an exponential drift of the population distribution towards 
incorrect mappings. The drift-rate was found to depend on both the mutation 
rate and the current fraction of correct correspondences. In other words, there 
is greater disruption when the population is dominated by a single consistent 
solution. In the case when the population contains a large number of dissimi- 
lar yet poor solutions, there is less disruptive drift. By contrast with the other 
operators, the role of the crossover operator is to exchange information via re- 
combination. The net effect is to blur the distribution of the fraction of correct 
solutions in a Gaussian manner. In other words, the mean fraction of correct 
solutions remains stationary, while the corresponding variance increases. 

Based on this operator-by-operator analysis, we have found conditions for 
convergence. We have obtained two interesting results First, the convergence 
rate for the standard genetic algorithm was found to be limited by the mutation 
probability; the limiting value is proportional to the total number of graph- 
edges. The second result was to show that in the case of the hill-climbing genetic 
algorithm, the corresponding limiting mutation probability is independent of 
the structure of the graphs. This result accords well with our previous empirical 
findings. 
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Abstract. We construct probabilistic generative models for the non- 
rigid matching of point-sets. Our formulation is explicitly Platonist. Be- 
ginning with a Platonist super point-set, we derive real-world point-sets 
through the application of four operations: i) spline-based warping, ii) 
addition of noise, iii) point removal and hi) amnesia regarding the point- 
to-point correspondences between the real-world point-sets and the Pla- 
tonist source. Given this generative model, we are able to derive new 
non-quadratic distance measures w.r.t. the “forgotten” correspondences 
by a) eliminating the spline parameters from the generative model and 
by b) integrating out the Platonist super point-set. The result is a new 
non-quadratic distance measure which has the interpretation of weighted 
graph matching. The graphs are related in a straightfoward manner to 
the spline kernel used for non-rigid warping. Experimentally, we show 
that the new distance measure outperforms the conventional quadratic 
assignment distance measure when both distances use the same weighted 
graphs derived from the spline kernel. 



1 Introduction 

The need for non-rigid image matching arises in many domains within the field 
of computer vision. Some form of non-rigid matching is required to match ob- 
jects that have undergone complex deformations. The extent of the deformation 
required to achieve non-rigid matching is an important quantity as it provides a 
convenient measure of distance between the two objects. 

Non-rigid matching methods can be broadly divided into two categories; 
intensity-based and feature-based. Intensity-based methods attempt to calculate 
the optical flow between the two images. Usually, these methods require fairly 
strong brightness constancy assumptions between the two images. Feature-based 
methods attempt to match two sets of sparse features that have been extracted 
from the underlying image intensities. Usually, these methods require both object 
and deformation models in order to constrain the set of allowed matches and 
deformations. 

Object models are typically constructed using a hierarchy of features: points, 
lines, curves, surfaces, etc. If matching is performed using generic, unlabeled 
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point features, then the correspondence problem is acute. On the other hand, 
if high-level feature representations are used, the correspondence problem is 
alleviated but the matching is not likely to be robust against missing features. 
In addition, the constraints on the deformation model become more complex 
when high-level features are used. 

In this paper, we are mainly concerned with deriving new distance measures 
for non-rigid matching of unlabeled point features. The new distance measure 
is a function of the unknown point-to-point correspondences and can handle 
outliers as well. Since we mostly focus on the new distance measure, at this 
point we are not presenting an algorithm to minimize this distance. 

The formulation of our problem is explicitly Platonist. We begin by assuming 
a Platonist super point-set of unlabeled features. By using a probabilistic thin- 
plate spline warping model, we are able to generate real-world point feature sets. 
Outliers are explicitly modeled by forcing each real-world warped point-set to be 
a strict subset of the Platonist super point-set. The final step in this generative 
model is the loss of information of the point-to-point correspondences between 
the real-world point-set and the Platonist super point-set. 

After exploiting a Platonist analogy in formulating this model, we then derive 
a new distance measure between the real-world point-sets. First, we eliminate 
the thin-plate spline warpings from the model by setting these parameters to 
their maximum a posteriori estimates. Then, in typical Bayesian fashion, we 
integrate out the hidden Platonist super point-set. The result is a new non- 
quadratic distance measure between all of the real-world point-sets defined solely 
in terms of the unknown correspondences. 

Having derived the new non-quadratic distance measure, we present compar- 
isons with the more traditional quadratic assignment distance measures. As a 
by-product of our derivation, we are able to show that the new distance mea- 
sure is closely related to a weighted graph matching distance measure with the 
“graphs” determined by the thin-plate spline kernel. Both distance measures 
(quadratic and non-quadratic) use the same graphs derived from the spline ker- 
nels. Finally, we show that our new distance measure significantly outperforms 
the quadratic distance measure indicating a payoff resulting from our principled 
derivation. 

2 Review 

The various approaches to non-rigid image matching can be broadly grouped into 
two categories — intensity-based and feature-based. Intensity-based methods be- 
gin by assuming some form of brightness constancy reminiscent of optical flow 
methods [2] . Most methods in this class attempt to minimize an energy function 
that consists of two terms. The first term simply sums over the square of the 
intensity differences between the two images at each pixel. The second term is an 
elastic matching term which is typically derived from considerations of smooth- 
ness of the displacement field [7] . A free parameter is used to tradeoff between 
these two terms. The principal difficulty with this entire class of methods is that 
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the brightness constancy assumption is frequently violated. Recently, there has 
been considerable interest in using entropy and mutual information-based inten- 
sity distance measures [30,16] to overcome these limitations. A second problem 
with these methods is related to the lack of explicit object modeling. Since no 
attempt is made to construct object models, these methods cannot enforce corre- 
spondence constraints on structures that are a priori known to match. Recently 
[5], there has been some effort to overcome this limitation by including region 
segmentation information into the computation of optical flow. However, it is fair 
to say that at the present time, intensity-based image matching methods have 
yet to fully solve the aforementioned (two) problems by incorporating segmenta- 
tion information into mutual information-based estimation of displacement fields 
(flow) . 

Feature-based image matching methods form the second class of methods. 
In contrast to the optical flow-based intensity matching methods, feature-based 
methods are more varied. One way of dividing the space of feature-based meth- 
ods is along the lines of sparse versus dense features. Labeled landmark points 
are the most popular kind of sparse features since non-rigid matching of land- 
marks does not require a solution to the point-to-point correspondence problem 
For instance in [6], thin-plate splines (TPS) [31] are used to characterize the de- 
formation of landmarks. Basically, the non-rigid matching problem is solved by 
minimizing the bending energy of a thin-plate spline while forcing correspond- 
ing landmarks extracted from the two images to perfectly match. Landmark 
positioning “jitter” can be accounted for in this model by allowing a trade- 
off between the landmark position least-squares matching energy term and the 
spline bending energy term. This is analogous to the “vanilla” optical flow image 
matching method mentioned above. The major drawback of this method is the 
over-reliance on a few landmarks. Extracting labeled and corresponding land- 
marks from the two images is a difficult problem. Moreover, the method is quite 
sensitive to the number and choice of landmarks. 

Dense feature-based matching methods run the gamut of matching points, 
lines, curves, surfaces and even volumes [4]. These methods usually begin with 
an object parameterization. Then, the allowable ways in which the object can 
deform is specified [18,23]. The methods that fall into this class differ in object 
parameterizations and in the specification of the kinds of allowed deformations. 
In most cases, curves and/or surfaces are first fitted to features extracted from 
the images and then matched [18,28,27,10]. These methods work well when the 
surfaces (and curves) to be matched are reasonably smooth. Also, the surface 
fitting step that precedes matching is predicated on good feature extraction. 
These methods have not been widely accepted in domains such as brain matching 
due to the extreme variability of cortical surfaces. 

One of the principal reasons for the emphasis on object modeling in non- 
rigid matching is that it allows us to circumvent the correspondence problem. 
For example, once a smooth curve is fitted to a set of feature points, the match- 
ing can be taken up at the curve level rather than at the point level. Curve 
correspondence is easier than point correspondence [28] due to the strong con- 
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straint imposed by the smooth curve on the space of possible point-to-point 
correspondences. While the surface case is more complicated, surfaces can be 
approximately matched when they are smooth and the allowed deformations are 
not very complex [18,27]. The downside is the lack of robustness. Sensor noise 
sometimes makes it difficult to fit smooth curves and surfaces to an underly- 
ing set of feature points. In such cases, while point feature locations may still 
be trustworthy, the fitting of surface normals and other higher-order features 
becomes problematic. Consequently, these higher-order features cannot be used. 

In this paper, we begin with an integrated pose and correspondence formula- 
tion using point features. Essentially, we modify the pose parameters to include 
non-rigid deformations. We now turn to a review of recent approaches that at- 
tempt to integrate the search for correspondence in non-rigid matching. While 
the correspondence problem has a long history in rigid, affine and projective 
point matching [14], there is relatively a dearth of literature on non-rigid point 
matching. Recently, there has been some interest in using point-based correspon- 
dence strategies in non-rigid matching [23,10,29,32,21]. The modal matching ap- 
proach in [23] relies on the point correspondence approach pioneered in [24] and 
further developed in [25]. The basic idea here is to use a pairing matrix that is 
built up from the Gaussian of the distances between any point feature in one set 
and the other. The modes of this matrix are used to obtain the correspondence. 
In [23] , following [8] , the deformation modes of the point-sets are obtained from 
the principal components of the covariance matrix of a pre-specified training 
set of shapes. The main drawback of this approach is that it does not use the 
spatial relationships between the points in each set to constrain the search for 
the correspondences and the mapping. In [9], after pointing out this drawback, 
the inter-relationships between the points is taken into account by building a 
graph representation from Delaunay triangulations. The search for correspon- 
dence is accomplished using inexact graph matching [26]. However, the spatial 
mappings are restricted to be affine or projective. In [1], decomposable graphs 
are hand-designed for deformable template matching and minimized with dy- 
namic programming. However, the graphs are not automatically generated and 
there is no direct relationship between the deformable model and the graphs that 
are used. In [17], a maximum clique approach [20,12] is used to match relational 
sulcal graphs. Again, the graphs are hand designed and not related to spatial 
deformations. 

3 Deriving the Distance Measure 

We first present background material on the thin-plate spline — our choice for 
the non-rigid spatial mapping. Then, we bring in the unknown correspondences 
and proceed with the derivation of the new distance measures. 

3.1 Thin-Plate Splines 

Our main reason for choosing the thin-plate spline is due its well understood 
behavior in landmark matching [6] . Essentially, the thin-plate spline produces a 
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smoothly interpolated spatial mapping (with adherence to landmarks handled 
by the data term). The thin-plate spline (TPS) formulation required here is 
for 2D and 3D point matching. In both cases, we’ll consider smoothness terms 
comprising of second-order derivatives of the interpolating function. Lack of 
space does not permit us to present the thin-plate spline in great detail. Instead, 
we merely present a “bare-bones” derivation. The interested reader is referred 
to [31] for the general formulation and to [6] for the application to landmark 
matching. In Figure 1, we depict an example thin-plate spline warping. Note the 
decomposition into the affine and warping components — a special characteristic 
of thin-plate spline mappings. 



Two 2D Point Sets. 
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Fig. 1. Top Left: Original and warped point-sets. Top Right: Visualization of 
the thin-plate mapping. Bottom Left: Affine component of the mapping. Bottom 
right. Warping component of the mapping. 



Assume for the moment that we have N pairs of corresponding points in 
either 2D or in 3D. Denote the two point-sets by Z and X respectively. The 
representations of the point-set Z is shown for the cases of 2D and 3D below: 
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, and in 3D Z = 




_ 1 _ 




_1 A 



A similar representation holds for the point-set X as well. The representations 
in (1) are the so called homogeneous coordinates. 
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We now set up a thin-plate spline mapping from X to Z. Thin-plate splines 
are asymmetric in the sense that a mapping from X to Z cannot be easily in- 
verted to yield a mapping from Z to X. Minimizing the following energy function 
gives us a smooth spline interpolant capable of warping points in X arbitrarily 
close to points in Z. A regularization parameter A determines the closeness of 
the fit. In 2D, 



^tps(/) 



= '£\\Zi-f{x,)f 







dx^dx"^ 




dx^dx‘^ . 



(2) 



A similar expression holds in 3D. 

Define t = where D = 2 for 2D and D = 3 for 3D. Then 

ti {x],x‘l, . . . ,xf). Also, in 2D, [0i(t), 02(0, 03(0] "= [1,^0 with a 
straightforward extension to 3D. For the thin-plate spline energy function given 
in (2), it is possible to show that there exists a unique minimizer f\ given by 



D+l N 

hit ) = E T ^ ^i)") (3) 

k=l i=l 

where E{t — ti) is the Green’s function for the thin-plate spline: E{r) = logr 

in 2D and — |r| in 3D. Here \t — ti\ = ^JYlk=l(^^ ~ The minimizer / of 
the thin-plate spline energy function given in (3), is specified in terms of two 
unknowns c and d. Using (3), it is possible to eliminate / from the thin-plate 
spline energy function. When this is done, we get 

^tps 2 (c, d) = ||Z -Xd- Kef -h A trace c^Kc. (4) 

In (4), Z and X are the x (D + 1) point-sets, d is a (D 1) x (D + 1) affine 
transformation consisting of translation, rotation and shear components, c is a 
N X (D -1- 1) matrix of warping parameters (with all entries in the first column 
set to zero), and K is Oi N x N matrix corresponding to the Green’s function 
(which is different for 2D and 3D). The principal difference between Eft — ti) 
and K is that the latter is defined only at the landmark points: the matrix entry 
Kij = Efti - tj). 

As it stands, finding least-squares solutions for the pair (c, d) by directly 
minimizing (4) is awkward. Instead, a QR decomposition is used to separate the 
afhne and warping spaces. For more details, please see [31]: 

x = [Qi,Q2](f) ( 5 ) 



where Q\ and Q 2 are N x {D 1) and N x [N — D — 1) orthonormal matri- 
ces, respectively. R is upper triangular. With this transformation in place, (4) 
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becomes 



Etps(s,&nn\{'r,d) = WQlZ - + WQjZ - Rd - QJ KQ2'r\\'^ 

+X'j'^Q2KQ2'y, (6) 

where c = Q 27 and 7 is a {N — D — 1) x (D -\-l) matrix. Given this definition of c, 
X^c = 0. The least-squares energy function in ( 6 ) can be first minimized w.r.t. 7 
and then w.r.t. the affine transformation d. The final result after minimization is 

7 = {QIKQ 2 + and d = R®\Q^X - KQ 2 J). (7) 

The bending energy of the thin-plate spline after eliminating (c, d) is 

£^bending(-^) = trace [Z'^Q 2 {QIKQ 2 + • ( 8 ) 



3.2 A Platonist Formulation 

Having described the thin-plate spline spatial mapping in its two conventional 
(integral and matrix kernel) forms, we turn to the integrated pose and corre- 
spondence formulation. 

First, we no longer assume that the correspondence between the point-sets 
Z and X is known. We introduce a correspondence matrix M which obeys the 
following constraints. 

1. The correspondences are binary: Mai ^ {0? !}• 

2. Every point in X is matched to one point in Z: Mai = 1- 

3. Every point in Z is matched to one point in X or is an outlier w.r.t. X: 

In informal terms, M is a matrix with binary entries whose columns sum to one 
and whose rows may either sum to one or be all zero. The one-to-one corre- 
spondence constraint is not sacrosanct and can be modified to a classification 
(many-to-one) constraint. This is because, in non-rigid mapping, a one-to-one 
constraint is incorrect when, for example, points are deformed into lines. 

The bending energy of the thin-plate spline needs to be modified to take into 
account the introduction of the correspondence matrix M. Note that M allows 
us to generalize to the case of unequal point counts between X and Z. 

We now write a probabilistic generative model for obtaining Z given X and 
a set of spline parameters (c, d). Since the correspondence M is unknown, it is 
included in the generative model as a hidden variable. 



p(Z,M|X, c, d) 



exp [— Ei(Z, M, c, d)] 



^partl 



(9) 



where 

E^{Z,M,C,d) = J2Mai\\Za - (Xd)i - (Kc)if. (10) 

ai 

In (10), {Xd)i and {Kc)i are the elements of the vectors Xd and Kc respec- 
tively. The partition function Zparti is a normalization constant. Equation (10) 
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is reminiscent of a Gaussian mixture model with Z playing the role of the clus- 
ter centers and M being the complete data classification matrix [15]. If M is 
presumed known, the least-squares term in (10) reduces to the thin-plate spline 
least squares term. 

The pure bending energy term is exactly the same as in the thin-plate spline: 
p(c, d|X, A) = where £’ 2 ( 0 ) = A trace c^Kc. (11) 

^part2 

The two energy terms in (10) and (11) can be combined into one. The result- 
ing probabilistic generative model for Z can be written (after some algebraic 
manipulation) as 



-E{Z, M, c, 

■^part 

where 

E{Z, M, c, d) = \\MZ -Xd- Kef + A trace c^Kc 

Ttrace Z^[diag(^ Mai) ~ M^M]Z. (13) 

i 

The diag operator above takes a vector and rearranges it into a square matrix 
with the vector entries appearing along the diagonal. The remaining entries are 
zero. The binary nature of the entries of M makes the last term redundant since 
it is zero. However, the term should be kept in mind if and when the binary 
constraint on the entries of M are relaxed; in that event, the last term becomes 
significant once again. 

A few key observations can be made regarding the integrated correspondence- 
spline energy function in (13). After we define Zperm MZ^ note that the form 
of the energy function is exactly the same as that of the original thin-plate spline 
bending energy in (4). (The last term does not contain the warping parameters 
(c, d) and from the perspective of solving for (c, d) the previous statement holds.) 
Consequently, we can exploit all of the properties of the thin-plate spline that 
were briefly derived in the previous section to separate out the warping and 
afhne spaces. We can eliminate (c, d) from (13) and this step is quite similar to 
the work in [33]. The bending energy [after eliminating (c, d)] is 

-E’corr(gibend(^, M) = ti ace[Z'^ M'^ GM Z], 

where 

G = Q2{QIKQ2 + XIn^d^i)^^QI. (15) 

In deriving (14), we have dropped the last term in (13). 

We are now in a position to extend this formulation to the simultaneous non- 
rigid matching of several point-sets . . . , X^. We find that the traditional 

Platonist metaphor suits us admirably. The point-set Z assumes the role of the 
“light beyond the cave” and each point-set X^ is cast in the role of a “shadow 



(14) 



p{Z,M,c,d\X,X) = ^ 
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perceived on the cave wall.” We model the Platonist super point-set Z as a 
superset of all the points present in each of the real-world point-sets k G 
{1, . . . ,X}. We assume the following generative model for obtaining the real- 
world point-sets from the archetype Z. Each real-world point-set is obtained 
by (i) warping Z using a thin-plate spline, (ii) removing a subset of points from 
Z, (iii) adding additive white Gaussian noise (AWGN) to the remaining points 
and finally, (iv) erasing or forgetting the correspondence information between Z 
and the newly created point-set X^. 

Since we have already worked out the bending energy expression [in (14)] 
between Z and a single real-world point-set X, we now extend the formulation 
to cover the simultaneous matching of Z to all of the real-world points-sets 
X^, k G Henceforth, we denote the set comprising all real-world 

point-sets by X. The sets of all correspondences, warping and affine parameters 
are denoted by M, c and d respectively. 

The likelihood model for the Platonist super point-set Z is 



p(Z,M,c,d|X) 



exp 



Y.k=iE{Z,M\c\d^) 

■^partall 



n 



exp [-E{Z,M^,c^,d^)] 
yk 

^part 



(16) 



Platonist Formulation Real-World Reformulation 



Eliminate {c,d}, Integrate out Z 




Fig. 2. Left: Platonist Formulation. Right: Real-world reformulation. 



An important (and somewhat remarkable) fact about (16) is its separability. 
The Platonist super point-set Z is the sole bottleneck in the network of connec- 
tions between the real-world point-sets X. Gonsequently, with Z fixed, we can 
easily solve in closed-form for the entire set of thin-plate spline warping parame- 
ters c and d. Note that the set of correspondence matrices M is also held fixed. 
This calculation is merely a generalization of the earlier calculation involving 
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Z and X. Here, we have a set of point-sets X and Z. Our approach schema is 
depicted in Figure 2. On the left in Figure 2 is the original Platonist formulation 
with the Platonist super-point set Z acting as a generator for the point-sets X^. 
On the right in Figure 2 is the real world reformulation. With (c, d) eliminated 
and Z integrated out, we obtain a distance measure between all of the real-world 
point-sets. Note that the point-sets X^ have been replaced by the corresponding 
graphs G^. 



3.3 Eliminating the Spatial Mapping 

The spline parameter set (c, d) is eliminated exactly as before in (14). The 
only difference is that the elimination is carried out K times — once for each set 
of parameters {c^, /cG{l,...,iC}. We will not repeat this derivation. The 

bending energy after eliminating (c, d) is now a sum over all K bending energies: 

K 

i^corr®bend®totai(^, M) = ^ trace [Z^ Z] , (17) 

k=l 



where 

With the above solution for the spatial mapping parameters (c, d), we may write 
the likelihood for Z as 



K 

p(z,M,c,dix)= n 

k=l 



exp 1^ -F'corr(8)bend(8)total('^5 ^ )] 
yk 

^part 



(18) 



3.4 Integrating out the Platonist Super Point-Set 

Before integrating out Z, we wish to point out the need for this step. In a 
standard Bayesian formulation [3] , integrating out the latent variables is recom- 
mended because the probabilistic structure is preserved by integration. 

The distance measure between the real-world point-sets X is defined as 

D(M) - log J p{Z, M\X)dZ. (19) 

Note that the distance measure is a function of the unknown correspondences 
between each real-world point set X^ and the Platonist super point-set Z. 

In (19), we have used p(Z, M|X) as shorthand for p(Z, M, c, d|X). The Pla- 
tonist super point-set Z is now integrated out: 



D(M) = - log j exp 



K 



■ log det 



-Z^ I y](M*)'^G*M'= j Z 

\k=l 



dZ -f- terms indep. of M 



U=i 



( 20 ) 
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This is our non-rigid matching distance measure. It is a function of only the 
set of correspondences M. The thin-plate spline warping parameters have been 
eliminated and the Platonist super point-set Z has been integrated out. 

We now specialize to the case of non-rigid matching of two point-sets X and 
Y . The distance measure between X and Y is 



Aog®det(M^,M^) = pogdet [{M^YG^M^ + , 



( 21 ) 



where the “graph” is defined as 

= A<3? + A/]®‘ iQlf, (22) 

with a similar expression holding for . The graph G^ has the nice property 
that it is symmetric and non-negative definite. It can easily be made positive 
definite which aids in the computation of (21). 



3.5 Comparison with Traditional Quadratic Assignment Distance 
Measures 

The new log-det distance measure in (21) can be directly compared with more 
traditional quadratic assignment (QAP) distance measures. The QAP distance 
measure is the obvious foil for comparison since it is the basic quadratic distance 
measure that is popular and widely used. All the comparisons below are based on 
the non-rigid matching of two point-sets X and Y. The QAP distance measure 
is a quadratic distance between the two “graphs” G^ and G^ . Note that the 
derivation of the “graphs” from thin-plate spline kernels is a new contribution — 
one which is quasi-independent of the choice of distance between the two graphs. 
The QAP distance measure is 



(23) 



Due to the cyclical property of the trace operator, the QAP distance can be sim- 
plified as — —trace G^ G^ where . 



^qap(M^,M^) = -trace . 



4 Results 

Figures 3 and 4 compare the QAP distance with the new log-det distance. In 
Figure 3, we’ve compared the log-det distance measure with the quadratic dis- 
tance measure. There are no outliers going from the Platonist super point-set 
to the two real-world point-sets shown at the top left of the figure. All three 
distance measures perform well with the log-det distance showing the greatest 
separation. In Figure 4, we’ve compared the log-det distance measure with the 
quadratic distance measure. Somewhat surprising is the degree to which the log- 
det distance measure outperforms the quadratic distance. On the x-axis, we’ve 
plotted permutations over a fixed number of points. 
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Fig. 3. Top Left: Two warped 2D 20 point-sets originating from a 20 point-set. 
Top Right: Log-det distance measure. Bottom Left: Quadratic distance measure. 
Bottom right: Platonic distance measure. The distance measures (ordinate) are 
plotted against permutations (abscissa). The abscissa value indicates how many 
points were permuted to obtain the distances. When zero points are permuted, 
the distance corresponding to the “true” answer is obtained [0 is the true dis- 
tance and 0 is a distance point for a given permutation] and is plotted at the 
extreme left on the figure. 
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Fig. 4. Top Left: Two warped 2D 15 point-sets originating from a 20 point-set. 
Top Right: Log-det distance measure. Bottom Left: Quadratic distance measure. 
Bottom right: Platonic distance measure. The distance measures (ordinate) are 
plotted against permutations (abscissa). The abscissa value indicates how many 
points were permuted to obtain the distances. When zero points are permuted, 
the distance corresponding to the “true” answer is obtained [0 is the true dis- 
tance and 0 is a distance point for a given permutation] and is plotted at the 
extreme left on the figure. 
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As more points are permuted, the distance measure ought to increase. We find 
that this is the case for the log-det distance measure but not for the quadratic 
distance. The Platonic distance is the quadratic distance between Z and the real- 
world point-sets. Since Z contains all the information, this idealized distance 
performs quite well. Note that there is obviously a question as to what the 
“true” answer ought to be. However, the bending energy returned by the log-det 
distance seems to concur with the Platonic bending energy which is reassuring 
since the latter is the closest you can get to a “gold standard.” 




Fig. 5. The topology of the same weighted graph is shown using different thresh- 
olds. The node attributes and link weights are absent from this figure. As the 
threshold is increased (left to right), the topology becomes sparser as expected. 
The regularization parameter A is held fixed while the threshold (for displaying a 
graph weight) is increased. The interplay between the regularization parameter 
and the graph topology needs to be further investigated. 

In Figure 5, we show a point-set with 20 points and associated weighted 
graphs that have been derived from the spline kernel corresponding to the point- 
set. We wanted to explore the topology of the graph to see if the spline kernel 
seemed to use nearest neighbor heuristics in assigning weights. We thresholded 
the graphs (with increasing thresholds from left to right) after taking teh absolute 
value of each element. The topology clearly has a nearest neighbor bias which is 
more evident at larger thresholds. (The threshold for getting a certain number 
of connections is proportional to the regularization parameter A.) 

In Figure 6, we took a point-set and obtained several point-sets from it by 
progressively increasing the warping. We depict the topology (using the same 
threshold for all graphs) for all the warped point-sets. It should be clear from 
the figure that the topology gets increasingly distorted relative to the origianl 
topology as you go from smaller to larger warps. However, a family resemblence 
to the original parent is unmist akeable. 

5 Discussion and Conclusion 

The two main contributions of this paper are: i) a new definition of weighted 
graphs based on the thin-plate spline kernel and ii) a new non-quadratic distance 
measure that significantly outperforms the conventional quadratic assignment 
distance measure. To a certain extent, these two contributions are independent 
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Fig. 6. On the top an original point-set is shown along with its graph depicted 
for a certain threshold. Below, we show four different thin-plate warped point- 
sets and their associated graphs. The warping increases as you go from left to 
right. Note the increased distortion in the topology going from left to right. The 
same threshold was used for all the graphs. 



of one another. For instance, it should be possible to take our definition of 
weighted graphs and use a different distance measure. From the weighted graph 
standpoint, we have seen that the topology of the thin-plate spline kernel graphs 
(after thresholding) is somewhat similar to graphs derived from Delaunay trian- 
gulations with the important difference being that the spline-based graphs are 
not planar. The similarity stems from the fact that local connections are favored 
over more long range ones. We can also use different deformation mappings which 
should lead to different weighted graph definitions. For example, if we used a 
radial basis function (RBF) spline for the spatial mapping [33], the weighted 
graph would have a RBF kernel at its core. From the standpoint of the distance 
measure, we think that it is very significant that the new log-det distance mea- 
sure outperforms the quadratic assignment distance. For binary graphs, it has 
already been shown that non-quadratic distance measures outperform QAP dis- 
tances [11] and that seems to apply here as well. Enthusiasm must be tempered, 
however, until fast algorithms can be designed to find good, local minima of the 
new log-det distance measure. 

There are several ways to proceed on the algorithm front. First, it may be pos- 
sible to extend current deterministic annealing algorithms to the new distance. 
For instance, we could reduce the difficulty of algorithm design by choosing ap- 
propriate Legendre transformations [19] or by using Taylor series approximations 
of the log-det distance. Another approach would be to take the two topologies (af- 
ter suitable thresholding) and apply the new maximum clique-based algorithms 
developed in [22]. After matching the topologies, further refinement using the 
weights can be performed using the softassign weighted graph matching algo- 
rithm [13]. 
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In summary, we have shown that weighted graphs arise naturally in non-rigid 
point matching problems. The graphs directly depend on the parameterization of 
the deformations. In addition, we have found that a principled Bayesian Platon- 
ist formulation of the problem naturally leads to a new non-quadratic distance 
measure that outperforms the traditional quadratic assignment distance mea- 
sure. It remains to be seen if effective algorithms can be designed that can take 
advantage of the better properties of the new distance measure. 
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Abstract. We study the dynamical properties of two new relaxation 
labeling schemes described in terms of differential equations, and hence 
evolving in continuous time. This contrasts with the customary approach 
to defining relaxation labeling algorithms which prefers discrete time. 
Continuous-time dynamical systems are particularly attractive because 
they can be implemented directly in hardware circuitry, and the study 
of their dynamical properties is simpler and more elegant. They are also 
more plausible as models of biological visual computation. We prove that 
the proposed models enjoy exactly the same dynamical properties as the 
classical relaxation labeling schemes, and show how they are intimately 
related to Hummel and Zucker’s now classical theory of constraint satis- 
faction. In particular, we prove that, when a certain symmetry condition 
is met, the dynamical systems’ behavior is governed by a Liapunov func- 
tion which turns out to be (the negative of) a well-known consistency 
measure. Moreover, we prove that the fundamental dynamical properties 
of the systems are retained when the symmetry restriction is relaxed. We 
also analyze the properties of a simple discretization of the proposed dy- 
namics, which is useful in digital computer implementations. Simulation 
results are presented which show the practical behavior of the models. 



1 Introduction 

Relaxation labeling processes are a popular class of parallel, distributed compu- 
tational models aimed at solving (continuous) constraint satisfaction problems, 
instances of which arise in a wide variety of computer vision and pattern recog- 
nition tasks [1,9]. Almost invariably, all the relaxation algorithms developed 
so far evolve in discrete time, i.e., they are modeled as difference rather than 
as differential equations. The main reason for this widespread practice is that 
discrete-time dynamical systems are simpler to program and simulate on digital 
computers. However, continuous-time dynamical systems are more attractive for 
several reasons. First, they can more easily be implemented in parallel, analog 
circuitry (see, e.g., [4]). Second, the study of their dynamical properties is sim- 
plified thanks to the power of differential calculus, and proofs are more elegant 
and more easily understood. Finally, from a speculative standpoint, they are 
more plausible as models of biological computation [7]. 

Recently, there has been some interest in developing relaxation labeling 
schemes evolving in continuous time. In particular, we cite the work by Stod- 
dart [16] motivated by the Baum-Eagon inequality [12], and the recent work 
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by Li et al. [10] who developed a new relaxation scheme based on augmented 
Lagrangian multipliers and Hopfield networks. Yu and Tsai [19] also used a 
continuous-time Hopfield network for solving labeling problems. All these stud- 
ies, however, are motivated by the assumption that the labeling problem is formu- 
lated as an energy-minimization problem, and a connection to standard theories 
of consistency [8] exists only when the compatibility coefficients are assumed to 
be symmetric. This is well-known to be a restrictive and unrealistic assumption. 
When the symmetry condition is relaxed the labeling problem is equivalent to 
a variational inequality problem, which is indeed a generalization of standard 
optimization problems [8]. 

In this paper, we study the dynamical properties of two simple relaxation 
labeling schemes which evolve in continuous time, each being described in terms 
of a system of coupled differential equations. The systems have been introduced 
in the context of evolutionary game theory, to model the evolution of relative 
frequencies of species in a multi-population setting [18], and one of them has 
also recently been proposed by Stoddart et al [16], who studied its properties 
only in the case of symmetric compatibilities. Both schemes are considerably 
simpler that Hummel and Zucker’s continuous-time model [8] which requires 
a complicated projection operator. Moreover, the first scheme has no normal- 
ization phase, and this makes it particularly attractive for practical hardware 
implementations. Since our models automatically satisfy the constraints imposed 
by the structure of the labeling problem, they are also much simpler than Yu 
and Tsai’s [19] and Li et a/.’s [10] schemes, which have to take constraints into 
account either in the form of penalty functions or Lagrange multipliers. 

The principal objective of this study is to analyze the dynamics of these relax- 
ation schemes and to relate them to the classical theory of consistency developed 
by Hummel and Zucker [8]. We show that all the dynamical properties enjoyed 
by standard relaxation labeling algorithms do hold for ours. In particular, we 
prove that, when symmetric compatibility coefficients are employed, the models 
have a Liapunov function which rules their dynamical behavior, and this turns 
out to be (the negative of) a well-known consistency measure. Moreover, and 
most importantly, we prove that the fundamental dynamical properties of the 
systems are retained when the symmetry restriction is relaxed. We also study the 
properties of a simple discretization of the proposed models, which is useful in 
digital computer implementations. Some simulation results are presented which 
show how the models behave in practice and confirm their validity. 

The outline of the paper is as follows. In Section 2, we briefly review Hummel 
and Zucker’s consistency theory, which is instrumental for the subsequent devel- 
opment. In Section 3 we introduce the models and in Section 4 we present the 
main theoretical results, first for the symmetric and then for the non-symmetric 
case. Section 5 describes two ways of discretizing the models, and proves some 
results. In Section 6 we present our simulation results, and Section 7 concludes 
the paper. 
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2 Consistency and Its Properties 



The labeling problem involves a set of objects B = {6i, • • • ,6^} and a set of 
possible labels A = {1, • • • The purpose is to label each object of B with 
one label of A. To accomplish this, two sources of information are exploited. The 
first one relies on local measurements which capture the salient features of each 
object viewed in isolation; classical pattern recognition techniques can be practi- 
cally employed to carry out this task. The second source of information, instead, 
accounts for possible interactions among nearby labels and, in fact, incorporates 
all the contextual knowledge about the problem at hand. This is quantitatively 
expressed by means of a real-valued four-dimensional matrix of compatibility 
coefficients R = {rij{\, ja)}. The coefficient rij{X,ja) measures the strength of 
compatibility between the hypotheses “6^ has label A” and has label fi'A 
high values correspond to compatibility and low values correspond to incompat- 
ibility. In our discussion, the compatibilities are assumed to be nonnegative, i.e., 
Tij {X, /a) > 0, but this seems not to be a severe limitation because all the interest- 
ing concepts involved here exhibit a sort of “linear invariance” property [12]. In 
this paper, moreover, we will not be concerned with the crucial problem of how 
to derive the compatibility coefficients. Suffice it to say that they can be either 
determined on the basis of statistical grounds [11,15] or, according to a more 
recent standpoint, adaptively learned over a sample of training data [14,13]. 

The initial local measurements are assumed to provide, for each object 6^ G 5, 
an m-dimensional vector = (p^(l), • • • (^))^ (where “T” denotes the usual 

transpose operation), such that p^(A) > 0, i = 1 . . . n, A G yl, and "^xPiW ~ 
i = 1 . . .n. Each p^(A) can be regarded as the initial, non-contextual degree of 
confidence of the hypothesis “6^ is labeled with label A.” By simply concatenating 
Pi^p 2 ^' ' ' ^Pn^^ obtain a weighted labeling assignment for the objects of B that 
will be denoted by G A relaxation labeling process takes as input the 

initial labeling assignment and iteratively updates it taking into account the 
compatibility model R. 

At this point, we introduce the space of weighted labeling assignments: 



K = 




m 

PiW >0, i = l...n, AgA and ^p^(A) = 1, 

A=1 




which is a linear convex set of Every vertex of K represents an unambiguous 

labeling assignment, that is one which assigns exactly one label to each object. 
The set of these labelings will be denoted by : 

= |p G K I Pi{X) = 0 or 1, i = 1 . . . n, A G a| . 

Moreover, a labeling p in the interior of K (i.e., 0 < Pi(A) < 1, for all i and A) 
will be called strictly ambiguous. 

Now, let p G K be any labeling assignment. To develop a relaxation algorithm 
that updates p in accordance with the compatibility model, we need to define, 
for each object bi e B and each label A G A, what is called a support function. 
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This should quantify the degree of agreement between the hypothesis that hi is 
labeled with A, whose confidence is expressed by Pi(A), and the context. This 
measure is commonly defined as follows: 

n m 

= ■ ( 1 ) 

i=i M=i 

Putting together the instances qi{X;p), for all the Pi(A), we obtain an nm- 
dimensional support vector that will be denoted by q{p)X 
The following updating rule 



py'(A) 



pimw 



( 2 ) 



where t = 0 , 1 , . . . denotes (discrete) time, defines the original relaxation labeling 
operator of Rosenfeld, Hummel, and Zucker [15], whose dynamical properties 
have recently been clarified [12]. In the following discussion we shall refer to it 
as the “classical” relaxation scheme. 

We now briefly review Hummel and Zucker ’s theory of constraint satisfac- 
tion [ 8 ] which commences by providing a general definition of consistency. By 
analogy with the unambiguous case, which is more easily understood, a weighted 
labeling assignment p G K is said to be consistent if 



m m 

'^Pi{X}qi{X;p) >'^Vi{X)qi{X;p) , i = l...n (3) 

A=1 A=1 

for all u G K. Furthermore, if strict inequalities hold in (3), for all u 7 ^ p, then p 
is said to be strictly consistent. It can be seen that a necessary condition for p 
to be strictly consistent is that it is an unambiguous one, that is p G . 

In [ 8 ], Hummel and Zucker introduced the average local consistency^ defined 
as 



n m 

A{p) = J2Y.P^{X)qi{X) (4) 

i=l A=1 

and proved that when the compatibility matrix R is symmetric, i.e., rij(A, p) = 
rji{ii, A) for all i, j. A, p, then any local maximum p G K of A is consistent. Ba- 
sically, this follows immediately from the fact that, when R is symmetric, we 
have VA(p) = 2 g, VA(p) being the gradient of A at p. Note that, in general, 
the converse need not be true since, to prove this, second-order derivative infor- 
mation would be required. However, by demanding that p be strictly consistent, 
this does happen [ 12 ]. 

^ Henceforth, when it will be clear from context, the dependence on p will not be 
stated. 
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3 Continuous-Time Relaxation Labeling Processes 



The two relaxation labeling models studied in this paper are defined by the 
following systems of coupled differential equations: 



dt 



■PiW =PiW 






and 



d 

dt 



PiW =PiW 



- T.^,Pi{p)<ii{p) 
Y.nPi{p)<ii{p) 



(5) 



(6) 



For the purpose of the present discussion, qi{\) denotes the linear support as 
defined in equation (1). As a matter of fact, many of the results proved below do 
not depend on this particular choice. More generally, the only requirements are 
that the support function be nonnegative and, to be able to grant the existence 
and uniqueness of the solution of the differential equations, that it be of class 

el [6], 

In the first model we note that, although there is no explicit normalization 
process in the updating rule, the assignment space IK is invariant under dynam- 
ics (5). This means that any trajectory starting in IK will remain in IK. To see 
this, simply note that: 



which means that the interior of IK is invariant. The additional observation that 
the boundary too is invariant completes the proof. The same result can be proven 
for the other model as well, following basically the same steps. The lack of 
normalization makes the first model, which we call the standard model, more 
attractive than Hummel and Zucker’s projection-based scheme [8], since it makes 
it more amenable to hardware implementations and more acceptable biologically. 
The interest in the other model, called the normalized model and also studied 
by Stoddart [16], derives from the fact that, in a way, it is the continuous-time 
translation of the classical Rosenfeld-Hummel-Zucker relaxation scheme [15]. 
This will be clearer when we show the discretizations of the models. We note 
that, using a linear support function (1), the dynamics of the models is invariant 
under a rescaling of the compatibility coefficients rij(A,/i). That is, if we define 
a set of new compatibility coefficients r?^(A, /i) = arij{\, fi) + /^, with a > 0 and 
/3 > 0, the orbit followed by the model remains the same, while the speed at 
which the dynamics evolve changes by a factor a. 

As stated in the Introduction, one attractive feature of continuous-time sys- 
tems is that they are readily mapped onto hardware circuitry. In [17] we show a 
circuit implementation for the standard and the normalized models, respectively. 
As expected, the standard model leads to a more economic implementation. 
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The fixed (or equilibrium) points of our dynamical systems are characterized 
by = 0 or, more explicitly, by pi{X) qi{\) ~ = 0 for all 

i = 1 . . . n, A G A. This leads us to the condition 



Pi{X) > 0 ^ qi{\) = '^PiilJ,)qi{li) (7) 

which is the same condition we have for the Rosenfeld-Hummel-Zucker and 
Hummel- Zucker models. 

The next result follows immediately from a characterization of consistent 
labelings proved in [12, Theorem 3.1]. 



Proposition 1. Let p e K be eonsistent. Then p is an equilibrium point for 
the relaxation dynamies (5) and (6). Moreover, if p is strietly ambiguous the 
eonverse also holds. 



This establishes a first connection between our continuous-time relaxation 
labeling processes and Hummel and Zucker’s theory of consistency. 



4 The Dynamical Properties of the Models 

In this section we study the dynamical properties of the proposed dynamical 
systems. Specifically, we show how our continuous-time relaxation schemes are 
intimately related to Hummel and Zucker’s theory of consistency, and enjoy all 
the dynamical properties which hold for the classical discrete-time scheme (2), 
and Hummel and Zucker’s projection-based model. 

Before going into the technical details, we briefly review some instrumental 
concepts in dynamical systems theory; see [6] for details. Given a dynamical sys- 
tem, an equilibrium point x is said to be stable if, whenever started sufficiently 
close to X, the system will remain near to x for all future times. A stronger 
property, which is even more desirable, is that the equilibrium point x be asymp- 
totieally stable, meaning that x is stable and in addition is a loeal attraetor, i.e., 
when initiated close to x, the system tends towards x as time increases. One of 
the most fundamental tools for establishing the stability of a given equilibrium 
point is known as the Liapunov’s direct method. It involves seeking a so-called 
Liapunov function, i.e., a continuous real- valued function defined in state space 
which is non-increasing along a trajectory. 



4.1 Symmetric Compatibilities 

We present here some results which hold when the compatibility matrix R is sym- 
metric, i.e., rij{\, ji) = rji{ia, A), for alH, j = 1 . . . n and X, ja G A. The following 
instrumental lemma, however, holds for the more general case of asymmetric 
matrices. 
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Lemma 1. For all p eK we have 



where represents the inner produet operator, for both the standard and nor- 
malized relaxation sehemes (5) and (6). 

Proof: Let p be an arbitrary labeling assignment in K. For the standard model 
we have: 



q{p) ■ -^p = 'pqiWPiW ( «i(A) - '^Pi{p)qi{p} 



i,X 



= E 






Using the Cauchy-Schwartz inequality we obtain, for alH = 1 . . . n, 

j = (^E \/piO) • fpiWq'iw'^ 

< 'pPiW ■ 'pPiWqfW = 'pPiWqfW 



Hence, since J^xPiWqfW > iJ2\PiWqiWf , we have q{p) ■ > 0. 

The proof for the normalized model is identical; we just observe that: 



d _ V- “ T,^,p^(.p)qi 



T,aPiiP)qi(P) 



□ 



A straightforward consequence of the previous lemma is the following impor- 
tant result, which states that, in the symmetric case, the average local consis- 
tency is always non-decreasing along the trajectories of our dynamical systems. 

Theorem 1. If the eompatihility matrix R is symmetrie, we have 

^A(f) > 0 

for all p £ ~K. In other words, —A is a Liapunov funetion for the relaxation 
models (5) and (6). 
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Proof: Assuming rij(A, /i) = rji{fi^ A), we have: 



mp) = J2Y1 



dt 



iX jfi 

=m-^p>o 



□ 



As far as the normalized scheme is concerned, this result has been proven 
by Stoddart [16]. By combining the previous result with the fact that strictly 
consistent labelings are local maxima of the average local consistency (see [12, 
Proposition, 3.4]) we readily obtain the following proposition. 

Theorem 2. Let p he a strietly eonsistent labeling and suppose that the eom- 
patihility matrix R is symmetrie. Then p is an asymptotieally stable stationary 
point for the relaxation labeling proeesses (5) and (6) and, eonsequently, is a 
loeal attraetor. 

Therefore, in the symmetric case our continuous-time processes have ex- 
actly the same dynamical properties as the classical Rosenfeld-Hummel-Zucker 
model [12] and the Hummel-Zucker projection-based scheme [8]. 



4.2 Arbitrary Compatibilities 

In the preceding subsection we have restricted ourselves to the case of symmet- 
ric compatibility coefficients and have shown how, under this circumstance, the 
proposed continuous-time relaxation schemes are closely related to the theory of 
consistency of Hummel and Zucker. However, although symmetric compatibili- 
ties can easily be derived and asymmetric matrices can always be made symmet- 
rical (i.e., by considering Rp it would be desirable for a relaxation process 
to work also when no restriction on the compatibility matrix is imposed [8] . This 
is especially true when the relaxation algorithm is viewed as a plausible model 
of how biological systems perform visual computation [20] . 

We now show that the proposed relaxation dynamical systems still perform 
useful computations in this case, and their connection with the theory of consis- 
tency continues to hold. The main result is the following: 

Theorem 3. Letp £~K be a strietly eonsistent labeling. Then p is an asymptoti- 
eally stable equilibrium point for the eontinuous-time relaxation labeling sehemes 
defined in equations (5) and (6). 

Proof: The first step in proving the theorem is to rewrite the models in the 
following way: 

= F{f) 
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where, for alH = 1 . . . n and A G vl, 



Fi{X){p) = pi{\) 






for the standard model, and 




for the normalized model. 

Let DF{p) be the differential of F in p. We will show that if p is strictly 
consistent all eigenvalues of DF{p) are real and negative. This means that p is a 
sink for the dynamical system and therefore an asymptotically stable point [6] . 

We begin by recalling that a strictly consistent labeling is necessarily non- 
ambiguous. Denoting by \{i) the unique label assigned to object 6^, we have: 



where S is the Kronecker delta, i.e., 5 xy = 1 if x = y, and 5 xy = 0 otherwise. 
Furthermore, we have qi(^X{i)) > qi{\) for all A 7^ A(i). 

We first prove the theorem for the standard model. Deriving F with respect 
to pj (p ) , we have: 





If we arrange the assignment vector in the following way: 



P — (Pl(Al), • • • ' ' ’ 5Pn(Al), • • • iPni^Xm)) 



and define the matrices Cij = {Cij (A, as Cij (A, p) = , the differential 

takes the form: 




DF = 
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We can show that, if p is strictly consistent, Cij = 0 if i j. In fact, we 
have: 



dFjjX) 

dpj{p) 



(p) =Pi(A) 



/ dqijX) 

\dpj{p) 



— <^AA(i) 



( %(A) 



fl 



dqi{p) \ 

dPj(p) j 



dqi{Xii)) \ ^ 

dp,(p) ) 



In this case the differential takes the form: 

/ Cii 0 



DF = 



VO C, 



Analyzing the matrices Ca we can see that these too take a particular form 
on strictly consistent assignments. In fact we have: 



dFijX) 

dpi(p) 



(P) = <^Ap l^ft(A) - '^Pi{p)qi{p)j + 

t\\ ( ^'J'*(A) / N / pqi{p)\ 



= 5xp {qi (X) - qi{X{i))) + Sx\(i) 



dqi(X) 



dpi{p) 

= Sxp (*(A) -qi{X{i))) -Sxx(i)qi{p) 



-Qi{p) 



dqi{Hi)) \ 

dpi{p) J 



As we can notice, the non-zero values of Ca are on the main diagonal and 
on the row Cu{Xp) with A = X{i). Thus the eigenvalues of Ca are the elements 
on the main diagonal. These are: 



QiW - ^i(A(i)) for A ^ A(i), 
—qi(^X{i)) otherwise. 



(9) 



Since p is strictly consistent, qi{X) < qi(^X{i)) so all the eigenvalues are real and 
negative and not lower than —qi(^X{i)). This tells us that the assignment is a 
sink, and hence an asymptotically stable point for the dynamical system. 

We now prove the theorem for the normalized model. The fundamental steps 
to follow are the same as for the standard model; we mainly have to derive the 
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new values for the partial derivatives: 



dFijX) %<^Apft(A)+Pi(A)g^ 

dPj{pV 

Pi(A)ft(A) {SijQiip) + 

{T,^^Piip)Qi{p)) 



^ij ^Xp 



= SinS, 






QiW 






dqijX) 

dpj (p) 



- ^AA(i) 
— ^ij ^Xp 



%(A(i)) 

dpi(p) 



""«gi(A(i)) 

dqi(\{i)) 



- Sii5 



®(A) 



QiiHP) 



- SiiS 



ijO\X{i) 



^ij ^\p 

Qiip) 



ft(A(0) 



ij^AA(i) 



- SiiS, 



qiWQijp) 

®(A(i))2 



ij^Xp 



As the standard model, we have Cij = 0 for i 7 ^ j, and the matrices Ca are 
non-zero only on the main diagonal and on the row related to the assignment 
A(i). Once more, then, the eigenvalues are equal to the elements on the main 
diagonal. These are: 



qiiX)<S>qi{X{i)) 

QiiHi)) 

-1 



for A 7 ^ A(i), 
otherwise. 



( 10 ) 



Thus the eigenvalues are all real and negative and not lower than —1, i.e., 
strictly consistent assignments are sinks for system ( 6 ). □ 



The previous theorem is the analog to the fundamental local convergence 
result of Hummel and Zucker [ 8 , Theorem 9.1], which is also valid for the clas- 
sical relaxation scheme (2) [12, Theorem 6.4]. Note that, unlike Theorem 2, no 
restriction on the structure of the compatibility matrix is imposed here. 

5 Discretizing the Models 

In order to simulate the behavior of the models on a digital computer, we need 
to make them evolve in discrete rather than continuous time steps. Two well 
known techniques to approximate differential equations are the Euler method 
and the Runge-Kutta method. With the Euler method we have: 

pl+^{X)=pliX) + hFiiX){p) ( 11 ) 

where h is the step size. This equation is advantageous since it can be computed 
in a very efficient way, so it is the ideal candidate for our simulations. We will 
prove that, given a certain integration step h, this model enjoys all the dynamical 
properties shown for the continuous models it approximates. 
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In order to determine the difference in global behavior between the continuous 
models and the discrete approximations, we also use a finer discretization model: 
the IV grade Runge-Kutta method. This has been done on the assumption that 
this model would have a global dynamic behavior very similar to that of the 
continuous models. We have chosen the following Runge-Kutta scheme: 



We will prove that the models discretized with Euler’s method are well de- 
fined, that is, they map points in the assignment space K onto K. Euler’s scheme 
applied to our standard relaxation model (5) gives: 



We note that when h equals 1 the process is identical to the one recently pro- 
posed by Chen and Luh [2,3]. Their model imposes strict constraints on the 
compatibility coefficients to insure that K be invariant with respect to iterations 
of the process. However, it can be proven that, if an appropriate integration step 
h is chosen, it is not necessary to impose such constraints. 

It is easy to prove that always equals 1: 



But we have to prove that the iteration of the process never leads to negative 
assignments. 

Proposition 2. Let h < l/qi{X;p) for all i^X^p. Denoting by E the funetion 
generated applying Euler^s seheme to the model (5), then for all p G K, we have 




where the coefficients /ci, fe, ^4 represent: 



r ki{i,x) = hEi{x){p) 



fc2(i,A) = /iFi(A)(p+ifci) 

ks{i,X) = hFi{\){p+\L) 



^ ki{i, A) = hFi{X){p + kz) 





Ei{X){p) > 0. 

Proof: We have: 




=p‘(A) + hpliX) (^qjiX) - i]) 



>pl{X)-hpl{X)^=0 
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which proves the proposition. 



□ 



If we use the linear support function (1), the integration step can be: 



maxiA max^ (A, /x) | 

It can readily be seen that this model also corrects deviation from the as- 
signment space, provided that p\{X) > 0. In fact, given ~ 1 e we 

have: 



=E^^KA) + hJ2plW U(A) - 

A A A V /X 



— (1 s) h 



1 



=1 + e - sh'^pl{fi)qj{fi) 

/X 






As far as the normalized model is concerned, Euler’s scheme yields: 



pl^\x) = {i-h)pi(x)+h 



pjWQiw 

Y^^plipWiip) 



As can easily be seen, with h = 1, this is the same equation that defines the 
classical model. Thus for h = 1 the model is well defined. 

With an h lower than 1 the resulting assignment is a convex linear combina- 
tion of p and the assignment resulting from applying one iteration of the classical 
method to p. Since the assignment space K is convex, the resulting assignment 
will also be in K. 

We can see that this model is also numerically stable. In fact, with h = 1, if 
we have Pi{\) > 0, the model corrects any deviation from K in one step. On the 
other hand, with h < 1, if '^xPiW = 1 + 5, we have: 



'£pl-^'^{x)=J2{i-h)pl{x) + '£h 

A A A 



PiWQiW 

'L^plipWiip) 



= (1 — he) 



{1 — h)(l e) h 



That is, the iteration of the model reduces the deviation from K at every step. 

It is easy to prove that strictly consistent assignments are local attractors for 
these discrete models. In order to do this we must note that the differential of E 
is / + hDF] so, given an eigenvalue a of DF^ there is an eigenvalue of DE equal 
to 1 ha. Furthermore, this property defines all eigenvalues of DE. As we have 
seen in (9), the eigenvalues of DE calculated for the standard model are all not 
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lower than maxi<^—qi (A(i)) | and all strictly lower than 0; so, for any integration 

step lower than l/qi{\) for all i and all A, we have , for any eigenvalue b of DE^ 
b = lEha > 1 — = 0 and b = lEha < 1. Thus, strictly consistent assignments 
are hyperbolic attractors for the system [5]. The eigenvalues of DF calculated 
for the normalized model are all not lower than —1 and all strictly lower than 
0 (10); so, for h < 1 the eigenvalues of DE are all not lower than 0 and all 
strictly lower than 1. Thus, in this case as well, strictly consistent assignments 
are hyperbolic attractors for the system. 



6 Experimental Results 

In order to evaluate the practical behavior of the proposed models we conducted 
two series of experiments. Our goal was to verify that the models exhibit the same 
dynamical behavior as the classical relaxation scheme (2). The experiments were 
conducted using both the Euler and the Runge-Kutta discretizations described 
in the previous section. 

The first set of simulations were conducted over the classical “triangle” prob- 
lem introduced as a toy example in the seminal paper by Rosenfeld, Hummel and 
Zucker [15]. The problem is to label the edges of a triangle as convex, concave, 
right- or left-occluding. Here, only eight possible labelings are possible (see [15] 
for details). The compatibility coefficients used were the same as those given 
in [15]. As a first control we verified whether the models’ behaviors differ, start- 
ing from the eight initial assignments given in [15]. From these starting points 
all the models gave the same sets of classifications. After this preliminary test, 
we generated 100 random assignments and used them as starting points for each 
model. The iterations were stopped when the sum of Kullback’s I-directed diver- 
gence between two successive assignments was lower than 10*^^. All the models 
converged to a non- ambiguous assignment. Moreover the Euler discretizations 
of our dynamics gave the same results as the classical model for all initial as- 
signments, while the Runge-Kutta discretizations gave a different result only 
for one initial assignment. This single assignment was reached with the highest 
number of iterations of all the assignments generated. This is probably due to 
the symmetry of the problem: a similar problem can be seen with a uniform 
probability distribution among assignments. The iteration of each model should 
converge to the a priori probability of each classification, that is 3/8 for each 
occluding edge and 1/8 for convex or concave edges. What really happens is that 
the assignments start by heading towards the a priori distribution, but, after a 
few iterations, they head towards a non- ambiguous assignment. This happens 
because the a priori probability is not a hyperbolic attractor for the system. It 
is possible that a similar problem affected the only initial assignment that gave 
different results: the models headed toward different non- ambiguous solutions 
from a unique non- hyperbolic equilibrium that separates the orbits. The average 
number of iterations that the models needed to reach the stopping criterion is 
shown in Table 1. 
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Model 


Iterations 


classical (eq. (2)) 


79.1 


standard, discretized with Euler’s scheme 


118.8 


normalized discretized with Runge-Kutta scheme 


81.4 


standard, discretized with Runge-Kutta scheme 


87.2 



Table 1. Average number of iterations for the triangle labeling problem. 



The second set of simulations was carried out by generating random sets of 
(asymmetric) compatibility matrices. This is the set of tests which most effec- 
tively point out differences in the dynamic behavior of the models. Since there 
was no underlying scheme on the pattern of compatibility coefficients, we do 
not expect the models to converge to a non-ambiguous assignment each time. 
In fact, in our experiments, the classical model converged to a non-ambiguous 
assignment only 22% of the time. The aim of this set of tests was to verify 
whether, when the classical model converges to a non-ambiguous assignment, 
the other models converge to the same assignment. Ten random coefficients ma- 
trices were generated for this experiment and for each matrix the various models 
were started from ten random assignments. Hence we made a hundred tests for 
each model. The assignment space dimension was five objects (n = 5 )and three 
labels (m = 3). The stopping criterion was the same as the previous set of exper- 
iments. Here, the classical model (2) converged to a non-ambiguous assignment 
22 times. The Runge-Kutta discretization of both models converged to the same 
assignments 20 times, while Euler discretization of the standard model reported 
the same assignments 17 times. Table 2 reports the average number of iterations 
needed to reach the stopping criterion. 



Model 


Iterations 


classical (eq. (2)) 


294.0 


standard, discretized with Euler’s scheme 


643.1 


normalized discretized with Runge-Kutta scheme 


260.1 


standard, discretized with Runge-Kutta scheme 


324.5 



Table 2. Average number of iterations for the random compatibility experi- 
ments. 



7 Conclusions 

In this paper we have presented and analyzed two relaxation labeling processes. 
In contrast with the standard approach, these models evolve through continuous- 
time rather than discrete-time dynamics. This fact permits the design of analog 
hardware implementations, makes the study of its properties simpler and more 
elegant, and makes the model more plausible biologically. We have analyzed 
the dynamical behavior of the models and shown how it is intimately related 
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to Hummel and Zucker’s classical theory of consistency. We have proven that 
the models enjoy exactly the same dynamical properties which have already 
been proven for the classical processes. The dynamics of the models discretized 
through Euler’s scheme has also been studied. Experimental results confirm the 
validity of the proposed models. 
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Abstract. Accurate localization and tracking of facial features are cru- 
cial for developing high quality model-based coding (MPEG-4) systems. 
Eor teleconferencing applications at very low bit rates, it is necessary 
to track eye and lip movements accurately over time. These movements 
can be coded and transmitted to a remote site, where animation tech- 
niques can be used to synthesize facial movements on a model of a face. 
In this paper we describe the integration of simple heuristics which are 
effective in improving the results of well-known facial feature detection 
with robust techniques for adapting a dynamic mesh for animation. A 
new method of generating a self-adaptive mesh using an extended dy- 
namic mesh (EDM) is proposed to overcome the convergence problem 
of the dynamic-motion-equation method (DMM). The new method con- 
sisting of two-step mesh adaptation (called coarse-to-fine adaptation) 
can enhance the stability of the DMM and improve the performance of 
the adaptive process. The accuracy of the proposed approach is demon- 
strated by experiments on eye model animation. In this paper, we focus 
our discussion only on the detection, tracking, modeling and animation 
of eye movements. 



1 Introduction 

From the image analysis point of view, images can be considered as having 
structural features or objects such as contours and regions. These image fea- 
tures or objects have been exploited to encode images at very low bit rates. 
Research on this approach, known as model-based coding, which is related to 
both image analysis and computer graphics, has recently intensified. Up to now, 
most of the contributions to 3D model-based coding have focused on human 
facial images. Although a number of schemes for model-based coding have been 
proposed [4,13], automatic facial feature detection and tracking along with facial 
expression analysis and synthesis still poses a big challenge to the problem of 
finding accurate features and their motion. 

A variety of approaches have been proposed for detection of facial features. 

* This work is supported in part by the Canadian Natural Sciences and Engineering 
Research Council. 

E.R. Hancock and M. Pelillo (Eds.): EMMCVPR’99, LNCS 1654, pp. 269-284, 1999. 
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These include deformable template matching [8,24,16], hough transforms [5], 
and color image processing [3,21]. Matching deformable templates requires a 
fairly accurate initial localization of the template because the energy mini- 
mization process only finds a local minimum. Other problems are caused by 
using several energy terms and weighting factors during the different epochs of 
matching. Because of the definition of the energy terms in [24] the template 
also inclines to shrink. In this paper we overcome some of these difficulties by 
improving the initial localization process. We show that simple processing on 
color images coupled with Hough transform and deformable template matching 
can produce very accurate results. 

Another important component in model-based coding is synthesizing facial 
movements and expression at a remote site using the motion parameters de- 
tected on an actual face image and animation on a model of this face. To 
represent a facial expression, several approaches have been proposed relying on 
feature detection [2,24], facial motion and expression analysis [4,22,9], and facial 
expression synthesis [19,4]. However, little work has been done specifically on ac- 
curate eye expression synthesis. Because the eyes are one of the most significant 
organs contributing to a vivid face expression, subtle changes in eye movements 
can result in a different expression. Therefore accurate eye expression analysis 
and synthesis are necessary. Recently, face animation methods have been pro- 
posed in [10,14], however, the techniques need to correspond dots on faces to 
resolve the feature detection issue. 

In this paper, we present an approach to synthesize eye movements by us- 





• Motion estimation ‘image generation 

• Expression analysis • Expression synthesis 



Fig. 1. An example of a MPEG-4 (SNHC) implementation 



ing the extracted eye features and the extended dynamic mesh (called extended 
adaptive mesh) to compute the deformation of the eyes in a 3D model. After 
creating an individualized 3D face model [23], we map this model to the first 
frame of the face sequence. Based on the extracted eye features we apply the 
deformation parameters to the 3D model to synthesize eye movements in suc- 
cessive frames. The work described here is related to the emerging MPEG-4 
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guidelines for developing very low bit rate coding systems [20]. The Synthetic 
Natural Hybrid Coding (SNHC) component of MPEG-4 proposes that features 
and movement of features be detected and coded at a server site; these codes 
be transmitted following certain standards; and finally the codes be received at 
a client site and used for animation on a model to graphically emulate the real 
scene. Figure 1 shows an example of an MPEG-4 (SNHG) multimedia system. 
It should be noted that MPEG-4 does not specify the techniques to be used for 
feature detection to realize an actual implementation; this allows researchers 
(such as our group) to investigate alternative techniques as outlined in this pa- 
per. The MPEG-4 animation guideline outlines animation of feature points and 
suggests interpolation for non-feature points. We demonstrate that an extended 
adaptive mesh produces much more accurate and realistic animation compared 
to existing methods ([16]- [20], [13]). 

The remainder of this paper is organized as follows: In Section 2, we describe the 
approach proposed for accurate eye feature detection and tracking. In Section 3, 
an extended adaptive mesh technique for eye model adaptation and animation 
is described. Experimental results are shown in Section 4. Finally, concluding 
remarks are given in Section 5. 

2 Robust Eye Feature Detection 

Our approach to detecting the eyes is similar to [5] in that it uses Hough trans- 
form and deformable template matching, however, our approach also exploits 
color information to extract the eyes accurately. The algorithm can be outlined 
as follows: 

— Determine two coarse regions of interest for the eyes. 

— Search the iris of the eyes using a gradient based Hough transform. 

— Determine a fine region of interest for extracting the boundaries of the eyes. 

— Using color information get an initial approximation for the eye lids. 

— Localize the eye lids using deformable templates. 

After detecting the face region two coarse regions of interest in the upper left 
and upper right half of an image can be defined to detect the eyes. Also a 
coarse range for the size of the eyes can be derived [1]. Since the iris is the most 
significant feature of the eye and has a simple circular shape, it is detected first 
by using a gradient-based Hough transform for circles [6,12]. 

Information on the position, magnitude and direction of significant edges is 
extracted by convolving the intensity image with a Sobel kernel. Figure 2(a) 
shows an example of the robustness of the Hough transform which extracts the 
iris circle. After extracting the circles the deformable templates for the eye lids 
have to be initialized. The template along with all the parameters are shown in 
Figure 3 with the parameters being set as follows: 

h\ — \ h /2 — 0 w — 2i.2iTi'pig (l) 
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Fig. 2. (a)- (d) (left-right): (a) Extracted circle using gradient based hough trans- 
form; Image fields used for computing the potential energy: (b) image, (c) satu- 
ration, (d) edge. 



where Viris is the radius of the extracted circle. The orientation a is determined 
by the center points of the two circles. The initialized deformable template is 
also shown in Figure 3. For the matching of the deformable template two differ- 




Legend: 

w halfwidth 

hi upper height 

h2 lower height 

a orientation 

(Xc,yc) centerpoint 




Fig. 3. Deformable Template (model, initialization). 



ent types of image information are used to create potential fields, one in each 
epoch: z.e., saturation information and edge information. Examples of the image 
information extracted from a typical eye is shown in Figure 2(c) (d). The first 
step in localizing the eye lids is to approximate the position of the deformable 
template relative to the iris. This is done by minimizing the following energy 
{Esat) which is similar to the valley energy in [24]: 

Esat = -|^ J ^sat{x)dA ( 2 ) 

-^w 

Aw is the area inside the parabolas but not inside the circle of the iris and ^sat (^) 
is the inverted saturation value of the color image. Since only the location (not 
the size) is changed this method does not have the shrinking effect. The next 
step is to estimate the parameters hi and / 12 . Two regions of interest on both 
sides of the iris are defined (see Figure 4). The parameters of these regions are 
set as follows: 



'^ROI = 5 hjiOI = SViris 



dnoi = iris + 5 



(3) 
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Depending on the position of the iris inside the deformable template only the 
left or the right region of interest is used for further computation. By using the 
horizontal integral projection ([2]), and by detecting the two most significant 
opposite gradients in the projection, the position of a point on the upper and 
lower eye lid can be detected (see Figure 4). The parameters hi and h 2 of the 
template are updated as follows: 

7 7 iVup ~ Vcl 7 7 \yiOW ~ Vcl / .X 

fii — fii -j^ ri2 — ti2 -j^ 

yup and yiow are the y-coordinates of the detected points inside the region of 
interest of the upper and lower eye lids, and are the heights of the actual 
parabolas inside the region of interest. 

The last step is to match the deformable template accurately to the eye 




Legend: 

Wrqi width 
Hro, height 

d distance 




Fig. 4. Integral projections beside iris (template and example). 



lids by minimizing the following energy {Eedge)- 



E 



edge — 




(5) 



Bw is the boundary of the parabolas and <Pedge{^) is the edge magnitude. Dur- 
ing this minimization every parameter of the deformable template (location, 
orientation, height, width) can be changed. 



The tracking of eye features is similar to detection of eye features with the 
following differences: 

— The region of interest as well as the possible size and therefore the Hough 
space for the extraction of the iris can be restricted by using the position 
and size of the eye extracted in the previous frame. 

— Instead of using the deformable template, the matched template of the pre- 
vious frame is used for the initialization of the new template. 
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3 Eye Model Adaptation and Animation with Extended 
Adaptive Mesh 

The eye detection and tracking algorithms extract the contours of the iris and eye 
lids in each frame of the image sequence. These eye features are used to synthe- 
size the real motions of the eye on a 3D facial model. To make the eye animation 
realistic, we apply an extended adaptive mesh approach (z.e., extended dynamic 
mesh (EDM)), instead of the interpolation method [1], to animate the eye move- 
ment. Adaptive mesh (z.e., dynamic mesh (DM)) is a well known approach for 
adaptive sampling of images [17,18] and physically based modeling of non-rigid 
objects [19,11,16]. The results shown by the previous works [17,18,19,11] demon- 
strated that this technique has become the basis for many powerful approaches 
in computer vision and computer graphics. The adaptive mesh can be assembled 
from nodal points connected by adjustable springs. The fundamental equation 
[17] is a non-linear second order differential equation, which can be written as: 

m-i-^ +'ji— +gi=fi\t = l,...,N (6) 

where is the position of node i, is a point mass of node i, 7 ^ is the 
damping coefficient dissipating kinetic energy in the mesh through friction, is 
the external force acting on node i, gi is the internal force on node i due to the 
springs connected to neighboring nodes j. 

To simulate the dynamics of the deformable mesh, the equations of motion 
are numerically integrated forward through time until the mesh is nearly stabi- 
lized. Although a number of numerical methods to solve this equation have been 
used (e.^., Euler method, Runge-Kutta method [15,19]), the stability is still the 
main concern in achieving a satisfactory solution. Eor example, when a node 
moves across an image feature boundary associated with an abrupt change in its 
image intensity, the stiffness of those springs connecting with the node changes 
rapidly and results in a possible reversal of the nodal force, which may lead to 
perpetual oscillation of the node. In this type of situations a new equilibrium 
state cannot be reached. To make the mesh converge to a stable state, rui and 7 ^ 
must be carefully chosen. The overdamped behavior (z.e., large values of rUi and 
7 i) will contribute to enhance the stability of the numerical simulation, however 
it is at the expense of accuracy of the solution. To make the solution more 
stable and accurate, we extend the conventional dynamic mesh method ([17]) 
by introducing a so called “energy-oriented mesh” (EOM) to refine adaptive 
meshes. The major differences between conventional dynamic mesh (DM) and 
the EOM are: (1) EOM makes the mesh movement in the direction of mesh 
energy decrease instead of decreasing the node velocities and accelerations; ( 2 ) 
EOM checks the node energy in each motion step without considering the ve- 
locity, it is independent of the DM and can be the supplemental step to DM 
for stabilizing mesh movements. Therefore, our model adaptation procedure 
consists of two major steps: (1) coarse adaptation, which applies DM method 
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to make the large movement converge quickly to the region of an object; (2) 
fine adaptation, which applies EOM method to finely adjust the mesh obtained 
after the first step and make the adaptation more “tight” and “accurate” . 



3.1 Principle of Energy Minimization in Energy Oriented Mesh 

According to the principle of minimum potential energy, of all possible kinemat- 
ically admissible displacement configurations that an elastic body can take up, 
the configuration which satisfies equilibrium makes the total potential energy 
assume a minimum value [7]. The potential energy stated in the principle of 
minimum potential energy includes the strain energy and the potential energy 
formed by external forces. In our model, there is no external force and all the 
strain energy is stored as the elastic energy in springs. To reach the equilibrium 
state, the elastic energy in the springs have to be minimized by displacing nodes. 
If we let node i move under a nodal force while all the neighboring nodes are 
fixed, the node will move in the direction of the nodal force because the gradient 
of total spring energy on node i (Ei) is in the same direction as the nodal force 
(gi). This implies that for meshes associated with the image observations, if we 
let nodes move by successive steps based on the principle of minimum potential 
energy and reduce strain energy at each step, finally we should obtain a fine 
adaptation on this image. 

When a spring mesh is not in equilibrium state, those nodes with non-zero 
forces acting on them tend to move in the direction of the resultant nodal 
forces. The movements of nodes will reduce the energy caused by strain. When 
a final equilibrium state is reached, no further movements will occur. In order 
to prevent a node from being over-displaced at each step, energy change for 
each step must be checked to ensure that the step has reduced the energy in a 
non- increasing way along the direction of movement. 



3.2 Detailed Algorithm 

Since eye movements can result in subtle expressions on the human face, the ac- 
curate and detailed movement of the adaptive model is highly desirable. There- 
fore, we use a detailed eye model (120 vertices for each eye) in order to achieve 
more realistic animation. Following is the main procedure for eye animation: 

1. Based on the 3D eye model which is part of our existing 3D facial model, 11 
feature vertices are defined for each eye as shown in Figure 5. Once the eye 
lid contour and the iris are detected in the image sequence, the corresponding 
feature points on the template can be also determined simply by computing 
the points on the boundary of the parabola and by using the center of the 
circle. 

2. To adapt the remaining vertices (z.e., non- feature vertices) onto the eye 
image, two steps are applied: (1) coarse adaptation (DM method), (2) fine 
adaptation (EOM method). 
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(a) Coarse adaptation: 

Solve the dynamic motion equation (6) using conventional explicit Euler 
time-integration procedure [15] until the motion parameters (velocity 
and acceleration a^) are less than a certain threshold. 

-gi) (7) 

= V* + Atal (8) 

= x‘ + (9) 

where Bi denotes an operator whose role is to enforce boundary 
conditions or constraints. Equations (7), (8), and (9) are evaluated 
for all nodes, z.e., i = 1,...,A/', and consecutive time steps, z.e., 

t = 0, Z\t, 2 At , . . . , until and a^ are less than a certain threshold. 

In our implementation of Equation (6) no external force is involved. 
The boundary vertices are fixed, which include the feature vertiees de- 
fined on the eye model and the vertiees on the border of Eigure 5. Let 
node i be connected to Mi other nodes, z.e., node i is attached to Mi 
springs. The total internal force acting on node i due to these springs 
movement is: 

g,= (10) 

jWMi 

where = Xj — x^, x^ and Xj are the positions of nodes i and j; lij is 
the natural length of the spring connected from nodes i to j; ||r^j|| is its 
actual length; and Cij is the stiffness of the spring ij. 

Based on the nodal value (z.e., intensity in the nodal position), the 
springs automatically adjust their stiffness so as to distribute meshes in 
accordance with the local complexity of the image. Before calculating 
the stiffness, we apply a Sobel operator to obtain a gradient image, then 
normalize the intensity values of the gradient image within the range of 
[0, 1]. Suppose the stiffness of a spring changes linearly along with the 
nodal values on the normalized gradient image, the calculation is then 
as follows: 



Cij — —{k2l ki) 



( 11 ) 
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where ki is the pre-defined minimum stiffness of springs in the mesh; ki + 
k 2 is the maximum stiffness of springs; and I is derived from the nodal 
values on the normalized gradient image. Unlike the stiffness calculation 
in [17], which takes the average of two nodal values on a spring, we apply 
a weighted sum of nodal values as shown in Equation (12). This implies 
that the node closer to the feature vertices will contribute more to the 
stiffness. 



I = 



di + dj di + dj 



( 12 ) 



where Si and Sj are the nodal values on the normalized gradient image 
at nodes i and j respectively, di (or dj) is the minimum distance from 
node i (or j) to the nearest vertex in the set of extracted feature vertices. 



To obtain the nodal values (5'i, Sj)^ we use the conventional finite 
element concept to calculate the sub-pixel values in-between the neigh- 
boring pixels. As shown in Figure 6, we split the pixel rectangle into 
two triangular elements so that sub-pixels within a certain triangle (a 
plane) are linearly distributed with the same property. The purpose of 
splitting into two triangular elements is to prevent the node movement 
over-displace in the next fine adaptation process (e.^., jump across an 
edge boundary in a motion step). Let A denote a pixel at position 
{xa^Va) having value Ia- Four neighboring pixels A, B, C, D are split 
into two triangular elements AABC and ABCD. The value of sub-pixel 
p within AABC can be obtained from Equation (13) (see [7]): 

Ip = aiXp + a2Vp + 03 ( 13 ) 

where 

01 = {xc - Xb)Ia + {XA - Xc)Ib + {XB - xa)Ic ( 14 ) 

02 = {vb - vc)Ia + {yc - yA)lB + {yA - yB)lc ( 15 ) 

03 = {xbVC - XcyB)lA + {xcyA ~ XAyc)lB + {XAys ~ XbVa)Ic ( 16 ) 

Similar equations can be used if the point p is within ABCD. 




Fig. 6. Node displacement within a triangle area in each step: ni: node start 
position; 77 , 2 : node end position; ns: node end position in next step. 
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(b) Fine adaptation: 

After a mesh stabilizes, fine adjustments can be done by EOM method for 
making meshes converge to the image “tightly” . Assuming that the image 
intensity changes continuously over the spatial domain, the grey values 
in between pixels can be obtained from Equation (13). The criteria of the 
fine movement of nodes is that only movements that deerease the node 
energy stored in the eonneeted springs are allowed. The nodes energy 
calculation and the nodes motion rules are described below: 

— obtain node force. Of all the nodes on the mesh except the bound- 
ary vertices, find a node with the largest value of the nodal force using 
Equation (10), e.g., g^. Within a sub-pixel domain (triangle area), 
search sub-pixels along the direction of the gi vector in order to find 
one having minimum node energy. 

— obtain node energy. The strain energy stored in a spring ij is 
calculated as follows: 



Node energy Ei is defined as the summation of the energy stored in 
all the springs connected to node i, z.e.. 



— Displacement of a node. Displacement of a node at a step is 
along the direction of the resultant nodal force. Theoretically, a 
node should move to a new position within the triangle domain 
where the node energy is a local minimum. To simplify the compu- 
tation, in the current implementation, we calculate the node energy 
in three positions (ie., node start position (e.^., ni), middle position 
(e.^., p), and node end position (e.^., ^ 2 )). The position with the 
minimum node energy is the new position that the node is allowed 
to move to. So a displacement at a step is only within a triangular 
area (including the boundary lines AC and BC, for example in 
Eigure 6 from the node start position to the node end position). 
The maximum displacement of a node in one step will not exceed 
the distance between two adjacent pixels. 

The rules for moving a node to a new position follows two con- 



• Check the node energy at the start position and the end position. 
The energy must decrease; this ensures that the spring system 
has reached a state with less energy. 

• To prevent the reversal of nodal force in the new position, the 
inner product of the force vector {El) at the node position with 
the force vector {Em) at the new position must be greater than 
zero; this prevents the oscillating movement of nodes. 



Eij — Cij ( 1 1 1 1 hj ) 



(17) 




(18) 



jU Mi 



ditions: 
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If the two conditions above are satisfied, the node is allowed to move 
to the new position. Otherwise, the node stays in its cnrrent place, 
and the procednre checks the node with second largest force, repeats 
the above procednre and continnes nntil no node satisfies the above 
two conditions or the largest nodal force in the mesh is small enongh 
(less than a certain threshold). Fignre 7 shows a fiowchart of the 
extended adaptive mesh algorithm in its cnrrent implementation. 



Coarse Adaptation 






Fine Adaptation 


(DM method) 








(EOM method) 



Snage with extracted | 
feature terrplate I 



Obtain gradient image 



Normalization 



Boundary vertices fixed 
(feature vertices 
and border vertices) 



^ Each step in iteration| — x 
Parameter calculation: 
ix)de value, stiffness, 
nodal internal force, 
velccity, acceleraticxi, 

^ displacement 



^ For each node Ye S 

^^j^locity and accelerati^; 
e Threshold 



Find a node having the ] 
largest nodal force 
among all the nodes 



Energy check 
in the directicn 
of the nodal force Fr 



Find a minimum energy] 
with nodal force ^ 
within the trianglular 
sub-pixel ' 




( End ) 



Fig. 7. Schematic diagram of the extended adaptive mesh method. 



4 Experimental Results 

To evaluate the algorithms developed, experiments with various color images 
of different eyes and eye sequences were made. The preliminary experiments 
have shown that the gradient-based Hough transform, which is the first and 
therefore one of the most important steps, is very robust against noise as well 
as edges which are not produced by the contour of the iris. The deformable 
template using color information makes the eye contour detection more robust 
and accurate. The results on tracking and animating of the iris and the eye 
lids in one image sequence are shown in Figure 8. The animated eye sequence 




Fig. 8. Extracted eye templates in an actual sequence of eye images (frame 1, 24, 68, 
169). 
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(for Figure 8 images) after the first step (DM) and the second step (EOM) 
are shown in Figure 9, in which a plane mesh is used for testing our extended 
adaptive mesh algorithm. The overlapped results of coarse and fine adaptations 
are shown in Figure 10. Figure 11 shows the texture-mapped results using the 
first frame texture of the original sequence. The improvement of the adaptation 
accuracy from coarse adaptation (DM) to fine adaptation (EOM) can be clearly 
seen comparing the top sequence with the bottom sequence in Eigure 9 and 
Eigure 10. Eigure 11 (bottom) shows that the synthesized eye movements using 
the extended adaptive mesh are very close to representing the original sequence 
(Eigure 8). 




Fig. 9. Eye movement, [top]: coarse adaptation using DM method {rui = 
1.2, = 1.2, ki = 1.0, k 2 = 9.0, threshold of and are 60.0 at the time of 
stopping adaptation), [bottom]: fine adaptation using EOM method (the largest 
nodal force is 0.05 when adaptation is stopped). 




Fig. 10. Overlapped eye mesh: coarse adaptation (top); fine adaptation (bot- 
tom). 



To apply the extended adaptive mesh method to a real face model, we use a 
detailed wireframe model with 2954 vertices and 3118 patches, in which there 
are 120 vertices for each eye. Eye movements are synthesized as follows: Eirst, 
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Fig. 11. Synthesized images using the first frame texture: texture-mapped re- 
sults of coarse adaptation mesh (top) and the fine adaptation mesh (bottom), 
eye movements are modeled as deformations of the individualized wireframe 
model (Figure 12) by using our existing face modeling tool [23]. Then, the 
adapted wireframe models in successive frames are texture-mapped by using the 
first frame of the sequence. Figure 13 shows a set of synthesized images showing 
eyes’ animation with texture mapped on the facial model. 

Observe that the main advantage of our method is the improvement in quality 
of the animated images compared to ones described in the existing literatures 
([16]- [20], [13]). One can clearly see different expressions that a salesman may 
have depending on his interaction with customers. 

Note that even though the DM method is faster the mesh is not as accu- 
rate as our EOM improvement. The EOM improvement is essential to create 
the realistic animation. Erom the preliminary experimental results, we can see 
that the algorithms proposed here behaves well for synthesizing eye movements 
in facial animation. The results are evaluated subjectively (see Eigures 8, 11, 
12, 13). 

Applying the same extended adaptive mesh method, Eigure 14 shows lip anima- 
tion on the same set of faces. The theory and implementation behind tracking 
and animation of lip movements is similar to eye tracking and animation, how- 
ever, it is not described in this paper because of limitations on the length of this 
document. 

5 Conclusions and Future Work 

In this paper we discussed robust methods for detecting and tracking eye move- 
ments, a strategy for eye movement animation, and resulting applications in 
model-based low bit-rate coding. An energy oriented method is described as an 
extension to dynamic meshes consisting of a network of springs. The proposed 
method refines the model adjustment process so as to improve the accuracy 
of model adaptation and tracking. It has also overcome the convergence prob- 
lem that is commonly encountered in numerical solutions to dynamic-motion 
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Fig. 12. Original facial image sequence (first frame), the individualized model 
and the adaptation result (salesman from SGI company). 




Fig. 13. Synthesized eyes with texture-mapped model (from left to right, top to 
bottom: frame 1, 36, 81, 109, 125, 160). 
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Fig. 14. Synthesized eyes and lips with texture-mapped model (from left to 
right, top to bottom: frame 200, 253, 304, 375, 410). 



equations. Experimental results show that realistic animations with subtle ex- 
pressions can be achieved by our system. We expect that the algorithms proposed 
in this paper will contribute towards future modification to MPEG-4(SNHC). 

In our current implementation, the maximum force that satisfies the displace- 
ment criteria is searched from all the nodes after each move of a node, which 
takes significant computation time. One alternative is to move a node until the 
node cannot be moved any further while keeping the neighboring nodes fixed. 
This change is expected to reduce the search time and improve the efficiency 
of the EOM adaptation process. Another aspect that should be noted is that 
the accuracy of the coarse-to-fine adaptation is at the cost of a higher bitrate, 
we will investigate developing an efficient parameter compression method for 
non-feature vertices to address this issue. 
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Abstract. The paper presents a new approach to recovering the 3D 
rigid shape of rigid objects from a 2D image sequence. The method has 
two distinguishing features: it exploits the rigidity of the object over the 
sequence of images, rather than over a pair of images; and, it estimates 
the 3D structure directly from the image intensity values, avoiding the 
common intermediate step of first estimating the motion induced on the 
image plane. The approach constructs the maximum likelihood (ML) 
estimate of all the shape and motion unknowns. We do not attempt the 
minimization of the ML energy function with respect to the entire set 
of unknown parameters. Rather, we start by computing the 3D motion 
parameters by using a robust factorization appraoch. Then, we refine the 
estimate of the object shape along the image sequence, by minimizing the 
ML-based energy function by a continuation- type method. Experimental 
results illustrate the performance of the method. 



1 Introduction 

The recovery of three-dimensional (3D) structure (3D shape and 3D motion) 
from a two-dimensional (2D) video sequence has been widely considered by the 
computer vision community. Methods that infer 3D shape from a single frame are 
based on cues such as shading and defocus. These methods fail to give reliable 
3D shape estimates for unconstrained real-world scenes. 

If no prior knowledge about the scene is available, the cue to estimating the 
3D structure is the 2D motion of the brightness pattern in the image plane. For 
this reason, the problem is generally referred to as structure from motion. The 
two major steps in structure from motion are usually the following: compute the 
2D motion in the image plane; and estimate the 3D shape and the 3D motion 
from the computed 2D motion. 

Structure from Motion Early approaches to structure from motion processed a 
single pair of consecutive frames and provided existence and uniqueness results 
to the problem of estimating 3D motion and absolute depth from the 2D motion 
in the camera plane between two frames, see for example [10]. Two-frame based 
algorithms are highly sensitive to image noise, and, when the object is far from 
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the camera, i.e., at a large distance when compared to the object depth, they fail 
even at low level image noise. More recent research has been oriented towards the 
use of longer image sequences. For example, in [8] , the authors use a Kalman filter 
to integrate along time a set of two- frame depth estimates, while reference [4] 
uses nonlinear optimization to solve for the rigid 3D motion and the set of 
3D positions of feature points tracked along a set of frames. References [9] and [2] 
estimate the 3D shape and motion by factorizing a measurement matrix whose 
entries are the set of trajectories of the feature point projections. 

The approaches of the references above rely on the matching of a set of 
features along the image sequence. This task can be very difficult when processing 
noisy videos. In general, only distinguished points, as brightness corners, can be 
used as ’’trackable” feature points. As a consequence, those approaches do not 
provide dense depth estimates. In [1], we extended the factorization approach 
of [9] to recover 3D structure from a sequence of optical flow parameters. Instead 
of tracking pointwise features, we track regions where the optical flow is described 
by a single set of parameters. The approach of [1] is well suited to the analysis 
of scenes that can be well approximated with polyhedral surfaces. In this paper 
we seek dense depth estimates for general shaped surfaces. 

To overcome the difficulties in estimating 3D structure through the 2D mo- 
tion induced onto the image plane, some researchers have used techniques that 
infer 3D structure directly from the image intensity values. For example [6] es- 
timates directly the 3D structure parameters by using the brightness change 
constraint between two consecutive frames. Reference [5] builds on this work by 
using a Kalman filter to update the estimates over time. 

Proposed approach To formulate the problem of inferring the 3D structure 
from a video sequence, we use the analogy between the visual perception mecha- 
nism and a classical communication system. This analogy has been used to deal 
with perception tasks involving a single image, such as texture segmentation, 
and the recovering of shape from texture, see for example [7]. In a communi- 
cation system, the transmitter receives a message S to be sent to the receiver. 
The transmitter codes the message and sends the resulting signal through 
the channel, to the receiver. The receiver gets the signal /, a noisy version of 
the signal . The receiver decodes / obtaining the estimate S of the message S. 
In statistical communications theory, we describe statistically the channel dis- 
tortion and design the receiver according to a statistically optimal criteria. For 
example, we can estimate S as the message S that maximizes the probability of 
receiving the signal /, conditioned on the message S sent. This is the Maximum 
Likelihood (ML) estimate. 

The communication system is a good metaphor for the problem of recovering 
3D structure from video. The message source is the 3D environment. The trans- 
mitter is the geometric projection mechanism that transforms the real world S 
into an ideal image . The channel is the camera that captures the image /, 
a noisy version of . The receiver is the video analysis system. The task of 
this system is to recover the real world that has originated the image sequence 
captured. 
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According to the analogy above, we recover the 3D structure from the video 
sequence by computing the ML estimate of all the unknowns: the parameters de- 
scribing the 3D motion, the object shape, and the object texture. A distinguish- 
ing feature of our work is the formulation of the estimate from a set of images, 
rather than a single pair. This provides accurate estimates for the 3D structure, 
due to the 3D rigidity of the scene. The formulation of the ML estimate from a 
set of frames leads to the minimization of a complex energy function. To min- 
imize the ML energy function, we solve for the object texture in terms of the 
3D shape and the 3D motion parameters. By replacing the texture estimate, we 
are left with the minimization of the ML energy function with respect to the 
3D shape and 3D motion. We do not attempt the minimization of the ML energy 
function with respect to the entire set of unknown parameters by using generic 
optimization methods. Rather, we exploit the specific characteristics of the prob- 
lem to develop a computationally feasible approximation to the ML solution. We 
compute the 3D motion by using the factorization method detailed in [2]. In fact, 
experiments with real videos show that the 3D rigid motion can be computed 
with accuracy through the optical flow computed across a set of frames for a 
small number of distinguished points or regions. After estimating the 3D mo- 
tion, we are left with the minimization of the ML energy function with respect 
to the 3D shape. We propose a computationally simple continuation method 
to solve this non-linear minimization. Our algorithm starts by estimating coarse 
approximations to the 3D shape. Then, it refines the estimate as more images are 
being taken into account. The computational simplicity of our algorithm comes 
from the fact that each refinement stage, although non-linear, is solved by a 
simple Gauss-Newton method that requires no more than one or two iterations. 

Our approach provides an efficient way to cope with the ill-posedness of 
estimating the motion in the image plane. In fact, the local brightness change 
constraint leads to a single restriction, which is insufficient to determine the 
two components of the local image motion (the so called aperture problem). Our 
method of estimating directly the 3D shape overcomes the aperture problem 
because we are left with the local depth as a single unknown, after computing 
the 3D motion in a first step. 

In this paper we model the image formation process by assuming orthogonal 
projections. Orthogonal projections have been used as a good approximation to 
the perspective projection when the object is far from the camera [9,1,2]. With 
this type of scenes, two- frame based methods fail to estimate the absolute depth. 
Although formulated assuming orthogonal projections, which leads to estimates 
of the relative depth, our method can be easily extended to cope with perspective 
projections, which then leads to estimates of the absolute depth. 

Paper Organization Section 2 formulates the problem. Section 3 discusses the 
ML estimate. Section 4 summarizes the factorization method used to estimate 
the 3D motion. Section 5 describes the continuation method used to minimize 
the ML energy function. Experiments are in section 6. Section 7 concludes the 
paper. 



288 Pedro M.Q. Aguiar and Jose M.F. Moura 



2 Problem Formulation 

We consider a rigid object O moving in front of a camera. We define the 3D mo- 
tion of the object by specifying the position of the object coordinate system rela- 
tive to the camera coordinate system. The position and orientation of O at time 
instant / is represented by m/ = Of, 0/, where [tuf^tyf^t^f) 

are the coordinates of the origin of the object coordinate system with respect to 
the camera coordinate system (3D translation), and (0 f , (/) f , ip f) are the Euler 
angles that determine the orientation of the object coordinate system relative to 
the camera coordinate system (3D rotation). 

Observation Model The frame If captured at time /, 1 < / < F, is modeled 
as a noisy observation of the projection of the object 



We assume that V is the orthogonal projection operator. For simplicity, the 
observation noise W f is zero mean, white, and Gaussian. 

The object O is described by its 3D shape S and texture T. The texture T 
represents the light received by the camera after reflecting on the object surface, 
i.e., the texture T is the object brightness as perceived by the camera. The 
texture depends on the object surface photometric properties, as well as on 
the environment illumination conditions. We assume that the texture does not 
change with time. 

The operator V returns the texture T as a real valued function defined over 
the image plane. This function is a nonlinear mapping that depends on the object 
shape S and the object position rrif. The intensity level of the projection of the 
object at pixel u on the image plane is 



where Sf(S,mf,u) is the nonlinear mapping that lifts the point u on the im- 
age //to the corresponding point on the 3D object surface. This mapping 
Sf{S,mf]u) is determined by the object shape S, and the position m/. To 
simplify the notation, we will usually write explicitly only the dependence on /, 
i.e., Sf{u). Figure 1 illustrates the lifting mapping Sf{u) and the direct map- 
ping Uf{s) for the orthogonal projection of a two-dimensional object. The inverse 
mapping Uf{s) also depends on S and m/, but we will, again, usually show only 
explicitly the dependence on /. On the left of figure 1, the point s on the surface 
of the object projects onto Uf{s) on the image plane. On the right, pixel u on 
the image plane is lifted to Sf{u) on the object surface. We assume that the 
object does not occlude itself, i.e., we have Uf{sf{u)) = u and Sf{uf{s)) = s. 
The mapping Uf{s), seen as a function of the frame index /, for a particular 
surface point s, is the trajectory of the projection of that point in the image 
plane, i.e., it is the motion induced in the image plane, usually referred to as 
optical flow. 

The observation model (1) is rewritten by using (2) as 



If=V{0,mf) + Wf. 



( 1 ) 



V (O, TUf) (u) = T (Sf(s, mf, u)) 



( 2 ) 



If{u)=T{sf{u)) + Wf{u). 



(3) 
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Fig. 1. Mappings Uf{s) and Sf{u). 



We consider the estimation of the 3D shape S and the 3D motion {m/,1 <f<F} 
of the object O given the video sequence {//, 1 < / < F} of F frames. 
Maximum Likelihood estimate formulation Given the observation model, 
the 3D shape and the 3D motion of the object O are recovered from the video 
sequence {If A < / < F} by estimating all the unknowns: the 3D shape 5; the 
texture T; and the set of 3D positions of the object |m/,l < / < F} with 
respect to the camera. We formulate the ML solution. When the noise sequence 
{W f{u)} is zero mean, spatially and temporally white, and Gaussian, the ML 
estimate minimizes the sum over all the frames of the integral over the image 
plane of the squared errors between the observations and the model^, 

F r 

Cml {S, T, {rrif}) = ^ / [If{u) - T {sf{u)f du, (4) 

f=iJ 

|.S,L {’^/}| = arg min Cml (5, T, {m/}) . (5) 

In (4), we make explicit the dependence of the cost function Cml on the object 
texture T. Note that Cml depends on the object shape S and the object positions 
|m/} through the mappings |s/(n)}. 

3 Maximum Likelihood Estimation 

We address the minimization of Cml (5, T, |m/}) by first solving for the texture 
estimate T in terms of the 3D object shape S and the object positions {rrif}. 
Texture Estimate We rewrite the cost function Cml given by (4) by changing 
the integration variable from the image plane coordinate u to the object surface 
coordinate s. We obtain 



Cml (5,T, |m/}) 






0]^ J/(s) ds, 



(6) 



^ We use a coutiuuous spatial depeudeuce for commodity. The variables u aud s are 
coutiuuous while / is discrete. 
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where Uf{s) is the mapping that projects the point s on the object surface onto 
the image plane at instant /, see figure 1. The function Jf{s) is the Jacobian of 
the mapping Uf{s)^ Jf{^) = 

Expression (6) shows that the cost function Cml is quadratic in each intensity 
value T (s) of the object texture. The ML estimate T (s) is 

, 7 ) 

(see appendix A for the proof). Expression (7) states that the estimate of the 
texture of the object at the surface point s is a weighted average of the measures 
of the intensity level corresponding to that surface point. A given region around s 
on the object surface projects at frame // to a region around Uf{s). The size of 
this projected region changes with time because of the object motion. The more 
parallel to the image plane is the tangent to the object surface at point s, the 
larger is the size of the projected region. Expression (7) shows that the larger the 
Jacobian Jf{s) is, i.e., the larger the region around s is magnified at frame //, 
the larger is the weight given to that frame when estimating the texture T(s). 
Structure from Motion as an approximation to ML By inserting the 
texture estimate T given by (7) in (6), we can express the cost function Cml in 
terms of the mappings {uf{s)}. After manipulations (see appendix B) we get 

CuL {S, {ruf}) = E E / Jf’ y"; ds. (8) 

/=2g=y Eh=lMs) 

The cost function Cml in (8) is a weighted sum of the squared differences between 
all pairs of frames. At each surface point s, the frame pair {If,Ig} is weighted 
Py ^ larger this weight is, i.e., the larger a region around s is 

Z-^h=l 

magnified in frames I / and Ig, the more the square difference between I f and Ig 
affects C7 ml- 

Expression (8) also makes clear why the problem we are addressing is re- 
ferred to as structure from motion: having eliminated the dependence on the 
texture, we are left with a cost function that depends on the structure (3D 
shape S and 3D motion {rrif}) only through the motion induced in the image 
plane, i.e., through the mappings {uf{s)}. Recall the comment on section 2 that 
Uf{S^rrif]s) depends on the shape S and the motion rrif. The usual approach 
to the minimization of the functional (8) is in two steps. The first step estimates 
the motion in the image plane Uf{s) by minimizing an approximation of (8) (in 
general, only two frames are taken into account). The second step estimates the 
shape S and the motion rrif from {rrif}. Since the motion in the image plane 
can not be reliably computed in the entire image, these methods cannot provide 
a reliable dense shape estimate. 

Our approach combines the good performance of the factorization method 
in estimating the 3D motion with the robustness of minimizing the ML energy 
function with respect to the object shape. 
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4 Rank 1 Factorization 



This section summarizes the factorization method used to estimate the 3D mo- 
tion. For a detailed description, see [2] . The factorization approach is robust due 
to the modelization of the rigidity of the moving object along time. This method 
is also computationally simple because it uses a fast algorithm to factorize a 
measurement matrix that is rank 1 in a noiseless situation. 

A set of N feature points are tracked along an image sequence of F frames. 
Under orthography, the projection of feature n in frame /, is 



Vfn 



U/ '^yf Ff 

_Jxf Jy f Jzf 



Vn 

Zn 



tuf 

tvf 



(9) 



where ixf^^f^Ff^jxf^jyf^ and jzf are entries of the well known 3D rotation 
matrix, uniquely determined by the Euler angles Of^cjf^ and see [3], and tuf 
and tvf are the components of the object translation along the camera plane. We 
make the object coordinate system and camera coordinate system coincide in the 
first frame, so we have uin = Xn and vin = yn- Thus, the coordinates of the fea- 
ture points along the camera plane yn} are given by their projections in the 
first frame. The goal of the factorization method is to solve the overconstrained 
equation system (9) with respect to the following set of unknowns: the 3D posi- 
tions of the object for 2 < / < F, and the relative depths {zn^l < n < N}. 

By choosing the origin of the object coordinate system to coincide with the 
centroid of the set of feature points, we get the estimate for the translation as 
the centroid of the feature point projections. Replacing the translation estimates 
in the system of equations (9), and defining 
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we rewrite (9) in matrix format as 



(11) 



R = MS^. 



( 12 ) 



Matrix R is 2{F — 1) x N but it is rank deficient. In a noiseless situation, R is 
rank 3 reflecting the high redundancy in the data, due to the 3D rigidity of the 
object. 
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The factorization approach finds a snboptimal solution to the bilinear LS 
problem of equation ( 12 ) where the solution space is constrained by the or- 
thonormality of the rows of the matrix M (11). This nonlinear minimization 
is solved in two stages. The first stage, decomposition stage^ solves the uncon- 
strained bilinear problem R = MS^ . The second stage, normalization stage, 
computes a set of normalizing parameters by approximating the constraints im- 
posed by the structure of the matrix M. 

Decomposition Stage Define M = [Mo,rris] and S = [Sq,z]. Mq and Sq 
contain the first two columns of M and S, respectively, m3 is the third column 
of M, and 2 : is the third column of S. We decompose the relative depth vector 2 ; 
into the component that belongs to the space spanned by the columns of and 
the component orthogonal to this space as 2 : = Sob + a, with a^So = [O O]- 
We rewrite R in (12) as = MqSq + rrisb^ S q + msa^ . 

The decomposition stage is formulated as 



min 

Mo, m3, b, a 



R - MoS^ 



msb^ Sq — 



F 



(13) 



where ||.||^ denotes the Frobenius norm. The solution for Mo is given by Mo = 

/ rp \<8)1 rp - — ^ 

RSo [SoSoj — msb . By replacing Mo in (13), we get 



min 

ma,a 



R — m^a^ 



where 



R = R 



I -S^ 



0 (^0^0) 



(g)i 



• (14) 



We see that the decomposition stage does not determine the vector b. This is 
because the component of 2 ; that lives in the space spanned by the columns of S'o 
does not affect the space spanned by the columns of the entire matrix S and the 
decomposition stage restricts only this last space. 

The soluBon for m 3 and a is given^by the rank 1 matrix that best ap- 
proximates R. In a noiseless situation, R is rank 1, see [2] for the details. By 
computing the largest singular value of R and the associated singular vectors, 
we get 

R zz u(jv^ , m 3 = au, = —v^ (15) 

a 

where a is a normalizing scalar different from 0. Toj^ompute u, a, and v we 
could perform an SVD, but the rank deficiency of R enables the use of less 
expensive algorithms to compute u, a, and v, as detailed in [ 2 ]. 
Normalization Stage In this stage, we compute a and b by imposing the 
constraints that come from the structure of M. We express M in terms of a 
and b as 



M = 



Mo m 3 



I2W 2 Q2D 1 
—ab^ a 




(16) 



The constraints imposed by the structure of M are the unit norm of each row 
and the orthogonality between row j and row j F — 1, where F is the number 
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of frames in the sequence. In terms of Af, a, and 6, the constraints are 



n, 



I2U : 



—ab 



-ab^ a^il^b^b) 



rij = 1, ni 



I2U : 



—ab 



-ab^ a^l^b^b) 






where nj denotes the row i of the matrix N. We compute a and b from the 
linear LS solution of the system above in a similar way to the one described 
in [9]. 



5 Minimization Procedure 



After recovering the 3D motion m/ as described is section 4, we insert the 
3D motion estimates into the energy function (8) and minimize with respect to 
the unknown shape S. 

We first make explicit the relation between the image trajectories Uf{s) and 
the 3D shape S and the 3D motion rrif. Choose the coordinate s of the generic 
point in the object surface to coincide with the coordinates [x^y]'^ of the object 
coordinate system. Under orthography, a point with coordinate s in the object 
surface is projected on coordinate u = [x^yY = s in the first frame, so that 
Ui(s) = s (remember that we have chosen the object coordinate system so that 
it coincides with the camera coordinate system in the first frame). At instant /, 
that point is projected to 



Uf{s) = Uf 
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= NfS + njz + tf, (18) 

where ixf^'^yf^'^zf^jxf^jyf^ and jzf are entries of the 3D rotation matrix [3]. 
The 3D shape is represented by the unknown relative depth z. 

Modified Image Sequence for Known Motion The 3D shape and the 
3D motion are observed in a coupled way through the 2D motion on the im- 
age plane, see expression (18). When the 3D motion is known, the problem of 
inferring the 3D shape from the image sequence is simplified. In fact, the local 
brightness change constraint leads to a single restriction, which is insufficient to 
determine the two components of the local image motion (this is the so called 
aperture problem). Our method of estimating directly the 3D shape overcomes 
the aperture problem because we are left with the local depth as a single un- 
known, after computing the 3D motion in the first step. To better illustrate why 
the problem becomes much simply when the 3D motion is known, we intro- 
duce a modified image sequence {//,1 < / < T}, obtained from the original 
sequence {//, 1 < / < T} and the 3D motion. We show that the 2D motion of 
the brightness pattern on image sequence I f depends on the 3D shape in a very 
particular way. This motivates the algorithm we use to minimize (8). 

Consider the image If related to J/ by the following afhne mapping that 
depends only on the 3D position at instant /, 



= If{NfS + tf). 



(19) 
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From this definition it follows that a point s, that projects to Uf{s) in image I 
is mapped to Uf{s) = ['^/(^) ~tf] image If. Replacing Uf{s) by ex- 

pression (18), we obtain for the image motion of the modified sequence {//}, 

Uf{s) = s-\-N^^nfZ. ( 20 ) 

Expression (20) shows that the trajectory of a point s in image sequence {//} 
depends on the relative depth of that point in a very particular way. In fact, the 
trajectory has the same shape for every point. The shape of the trajectories is 
given by the evolution of N^^rif across the frame index /. Thus, the shape of 
the trajectories depends uniquely on the rotational component of the 3D motion. 
The relative depth 2 ; affects only the magnitude of the trajectory. A point with 
relative depth ^ = 0 is stationary in {//}, since we get Uf{s) = s from (20) for 
arbitrary 3D motion of the object. 

Continuation Method By minimizing (8) with respect to the relative depth 
of each point s, we are in fact estimating the magnitude of the trajectory of 
the point to where the point s maps in image sequence {//}. The shape of the 
trajectory is known, since it depends only on the 3D motion. Our algorithm is 
based on this characteristic of the ML energy function. We use a continuation- 
type method to estimate the relative depth of each point. The algorithm refines 
the estimate of the relative depth as more frames are being taken into account. 
When only a few frames are taken into account, the magnitude of the trajectories 
on image sequence {//} can be only roughly estimated because the length of the 
trajectories is short and their shape may be quite simple. When enough frames 
are considered, the trajectories on image sequence {//} are long enough, their 
magnitude is unambiguous, and^the relative depth estimates are accurate. Our 
algorithm does not compute {//}, it rather uses the corresponding intensity 
values of {//}. 

The advantage of the continuation-type method is that is provides a com- 
putationally simple way to estimate the relative depth because each stage of 
the algorithm updates the estimate by using a Gauss-Newton method, i.e., by 
solving a linear problem. We consider the relative depth z to be constant in a 
region IZ. We estimate 2 ; by minimizing the energy resultant from neglecting the 
weighting factor ' in the ML energy function (8). Thus, we get 

2^h=i 

F /®i „ 

z = wgmmE{z), E{z) = Y.T. I e^{z)ds, (21) 

/=2s=i‘'^ 

where 

e{z) = I f{N fS + rifZ -h tf) — Ig(NgS + rigZ -h tg). (22) 

We compute z by refining a previous estimate 2 ) 0 , as 

Sz = argminE(2)o + Sz). 



Z = Z^^dz, 



(23) 
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The Gauss-Newton method neglects the second and higher order terms of the 
Taylor series expansion of e{zo 6 z). By making this approximation, we get 

where is the derivative of e with respect to 2 ;. By differentiating (22), we get 

+ TlfZ + + IfyiNfS + TlfZ + tf)j, f 

9^ '^9^ ^9)'^Zg ~ T + ^g)jzg^ (25) 

where If^ and If^ denote the components of the spatial gradient of image If. 

At the beginning, we start with the initial guess zq = 0 for any region IZ. 
We use square regions where z is estimated as being constant. The size of the 
regions determines the resolution of the relative depth estimate. We use large 
regions when processing the first frames and decrease the size of regions as the 
continuation method takes more frames into account. 

6 Experiments 

We describe two experiments that illustrate our approach. The first experiment 
uses a synthetic sequence for which we compare the estimates obtained with the 
ground truth. The second experiment uses a real video sequence. 

Synthetic Sequence We consider that the world is 2D and that the images 
are ID orthogonal projections of the world. This scenario reflects all the basic 
properties and difficulties of the structure from motion paradigm and corresponds 
to the real 3D world if we consider only one epipolar plane and assume that the 
motion occurs on that plane. In figure 2 we show a computer generated sequence 
of 25 ID images. Time increases from top to bottom. The time evolution of the 
translational and rotational components of the motion are shown respectively in 
the left and middle plots of figure 3. The object shape is shown on the right plot 
of figure 3. The object texture is an intensity function defined over the object 
contour. We obtained the image sequence in figure 2 by projecting the object 
texture on the image plane and by adding noise. 

In figure 4 we represent the modified image sequence, computed from the 
original sequence in figure 2, as described in section 5 for the 3D scenario, see 
expression (19). The motion of the brightness pattern in figure 4 is simpler than 
the motion in figure 2. In fact, the horizontal positions of the brightness patterns 
in figure 4 have a time evolution that is equal for the entire image (see, from 
figure 4 and the left plot of figure 3 that the shape of the trajectories of the 
brightness patterns is related to the rotational component of the motion). Only 
the amplitude of the time evolution of the horizontal positions of the brightness 
patterns in figure 4 is different from an object region to another object region. 
The amplitude for a given region is proportional to the relative depth of that 
region. Note that the brightness pattern is almost stationary for regions with 
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space 

Fig. 2. Sequence of 25 ID images: each horizontal slice is one image. 




Fig. 3. True motion and true shape. Left: rotational motion; middle: transla- 
tional motion; right: object shape. 



relative depth close to zero (see the regions around pixels 55 and 95 on the right 
plot of figure 3 and on figure 4). This agrees with the discussion in section 5. 

We estimated the relative depth of the object by using the continuation 
method introduced in section 5. The evolution of the relative depth estimate 
is represented in the plots of figure 5 for several time instants. The size of the 
estimation region IZ was 10 pixels when processing the first 5 frames, 5 pixels 
when processing frames 6 to 10, and 3 pixels when processing frames 11 to 25. 
The true depth shape is shown by the dashed line in the bottom right plot of 
figure 5. The top left plot was obtained with the first three frames and shows 
a very coarse estimate of the shape. The bottom right plot was obtained after 
all 25 frames of the image sequence have been processed. In this plot we made 
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Fig. 4. Modified image sequence for known motion. 
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a linear interpolation between the central points of consecutive estimation re- 
gions. This plot superposes the true and the estimated depths showing a very 
good agreement between them. The intermediate plots show progressively better 
estimates of the depth shape. 






Fig. 5. Continuation method: evolution of the shape estimate. Left to 
right, top to bottom, after processing F frames where F is succes- 
sively: 3,4,5,6,8,10,15,20,25. The true shape is shown as the dashed line in 
the bottom right plot. 



Real Video We used a sequence of 10 frames from a real video sequence showing 
a toy clown. Figure 6 shows frames 1 and 5. Each frame has 384 x 288 pixels. 
Superimposed on frame 1, we marked with white squares 20 features used in 
the factorization method. The method used to select the features is reported 
elsewhere. We tracked the feature points by matching the intensity pattern of 
each feature along the sequence. Using the factorization approach summarized 
in section 4 we recovered the 3D motion from the feature trajectories. 

We estimated the relative depth of the 3D object by using the continuation 
method described in section 5. The evolution of the estimate of the relative depth 
is illustrated by figure 7. The grey level images in this figure code the relative 
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Fig. 6. Clown sequence: frames 1 and 5. 



depth estimates. The brighter a pixel is, the closer to the camera it is in the first 
frame. The size of the estimation region IZ was 30 x 30 pixels when processing 
the first 3 frames, 20 x 20 pixels when processing frames 4 to 6, and 10 x 10 pixels 
when processing frames 7 to 10. The left image was obtained with the first three 
frames and shows a very coarse estimate of the shape. The right image was 
obtained after all 10 frames of the image sequence have been processed. 




7 Conclusion 

Final Remarks We presented a new approach to the recovery of 3D rigid struc- 
ture from a 2D video sequence. The problem is formulated as the ML estimation 
of all the unknowns directly from the intensity values of the set of images in 
the sequence. We estimate the 3D motion by using a factorization method. We 
develop a continuation-type algorithm to minimize the ML energy function with 
respect to the object shape. The experimental results obtained so far are promis- 
ing and illustrate the good performance of the algorithm. 

Future Extensions A number of possible extensions of this work are foreseen. 
First, our methodology can be modified to achieve the estimation of the abso- 
lute depth by using the perspective projection model. Other possible extensions 
include the investigation of different shape models. In this paper we use a dense 
depth map. Parametric models enable more compact and robust shape represen- 
tation. The formulation of the problem directly from the image intensity values 
provides a robust way of dealing with such issues as segmentation (for piece- 
wise models) and model complexity, by using classic tools such as the Bayesian 
inference or information-theoretic criteria. 
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A Texture Estimation 



To prove that the ML estimate T(s) of the texture T(s) is given by expres- 
sion (7), we show that it leads to the minimum of the cost function Cml, 
given by expression (6), over all texture functions T(s). Consider the candidate 
T(s) = T(s) -^U{s). The functional Cml for texture function T(s) is 



eML(T) = ^ f [lf{Ufis))-f{s)-U{s)f Jf(s) ds 

F F 

= If{uf{s)) -f{s) Jf{s)ds + '^ fu^{s)Jf{s) 

f=l-> 

P rr ^ . 

j [lf{uf{s))-T{s)\u{s)Jf{s)ds. 



ds 



(26) 



The first term of the expression above is Cml(T). The third term is 0, as comes 
immediately by replacing T(s) by expression (7). We have 



Cml(T) = Cml{T) + Y, U\s)Jf{s) ds > CML(b, (27) 

which concludes the proof. The inequality comes from the fact that we can 
always choose the texture coordinates s in such a way that the mappings Uf{s) 
are such that the determinants Jf{s) = |Vu/(s)| are positive. For example, 
make the texture coordinate s equal to the image plane coordinate u in the 
first frame I\. The mapping Ui(s) is the identity mapping Ui{s) = s and we 
have a positive Jacobian Ji{s) = 1. Now, draw an oriented closed contour on 
the surface 5, in the neighborhood of s, and containing s in its interior. This 
contour, which we call Cg is projected in image Ii in an oriented closed planar 
contour Cui- It is geometrically evident that the same contour Cg projects in 
image If^ in a contour Cuf that has, in general, different shape but the same 
orientation that the contour Cui (remember that we are assuming the object 
does not occlude itself). For this reason, the Jacobian Jf{s) of the function that 
maps from s to Uf{s) for 2 < / < F, has the same signal as the Jacobian Ji{s) 
of the function that maps from s to Ui(s), so we get Jf{s) > 0 for 1 < / < F. 



B Cml in Terms of {t*/(s)} 



We show that the ML-based cost function is expressed in terms of the motion in 
the image plane as in expression (8). Replace the texture estimate F(s), given 
by (7), into the ML-based cost function, given by (6). After simple algebraic 
manipulations, we get 



Cml 



r ^ 

/s 



/=i L 



"^g=i ^g(^) 

Jh{s) 



1 2 



Jf{s) ds. 



(28) 
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Expressing the square above in terms of a sum of products and carrying out the 
products, after algebraic manipulations, we get 



C"ml = 



/ 



ELi [b(^) - Jf{s)Jg(s)Ms) 



'^h=l ^h{s) 



1 2 



ds. (29) 



Now we divide by ^1^=1 ^h{s) both the numerator and the denominator of the 
integrand function. By using the equality 



F F F f<s>i 

E E J/(^) - = E E IFF) - Fi^)f 

f=lg=l f=2g=l 

(30) 

we get 



^ _ f EU IFis) - F(s)f Jf(s)Us) 

^ 



ds, 



( 31 ) 



and conclude the derivation. Note that by interchanging the integral and the 
sum in (31), we get the ML-based cost function Cml as in expression (8). 
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Abstract. Due to the inherently limited resolution of Positron Emission Tomog- 
raphy (PET) scanners, quantitative measurements taken from PET images suffer 
from the partial volume averaging of activity across regions of interest. A cor- 
rection for this effect in PET activity distributions is therefore essential to dis- 
tinguish differences due to changes in tracer concentration from those due to 
changes in the volumes of the active brain tissue. Various consequent image post- 
processing techniques have been developed to address this problem (see, for ex- 
ample, [1,2,3,4,5,6]). These operate using associated high resolution anatomical 
images such as Magnetic Resonance Imaging (MRI), but as well as being highly 
susceptible to errors in the requisite registration and segmentation procedures, 
the methods are reliant on unrealistic simplifying assumptions regarding activity 
distributions in the brain. This work instead couples the correction of PET data to 
the reconstruction process itself, presenting a two-step scheme using associated 
MRI data to achieve this. The first step estimates the prior activity distributions 
from a segmented and intensity transformed MRI image. This is then used in the 
second step to constrain the Bayesian PET reconstruction with varying degrees 
of stringency. The prior, or initial correction process, is applied in the form of 
an energy term, adapted in accordance to an entropy measure taken on the MRI 
segmentation; i.e., where there is anatomical variation, we assume there also to 
be activity variation. 



1 Introduction 

Positron Emission Tomography (PET) is a non-invasive functional imaging technique, 
unique in its ability to image pathological condition. A radioactive tracer is introduced 
into the blood stream of the patient to be imaged. Its distribution is then indicative 
of metabolic activity or blood flow. The isotope itself decays quite quickly, emitting 
positrons which, within a very short distance, collide with electrons. The annihilation 
that takes place causes two gamma emissions to occur in almost exactly opposite di- 
rections. These emissions are recorded in the various detector crystals that surround the 
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patient, and the projected counts are stored in sinograms. The aim of PET reconstruc- 
tion is to estimate the spatial distribution of the isotope concentration on the basis of 
the sinogram data. 

Because of the inherently limited resolution of PET scanners and the subsequent 
poor ability to resolve detail in reconstructed images, quantitative measurements taken 
from PET images suffer from the partial volume averaging of activity across regions of 
interest. A correction for this partial volume effect (PVE) in PET activity distributions 
is therefore essential to distinguish differences due to changes in tracer concentration 
from those due to changes in the volumes of the active brain tissue. Otherwise, as the 
relative percentage of cerebro- spinal fluid (CSE), grey matter (GM) and white matter 
(WM) varies - particularly as a result of aging or disease - any given change in the 
apparent tracer concentration may instead reflect a change in morphology, and without 
the benefits of an associated anatomical image, we are at loss in deciphering this. 

Various image post-reconstruction techniques have been developed to address such 
issues (see, for example, [1,2,3,4,5,6]). Operating in accordance to constraints derived 
from associated high resolution anatomical images, these attempt to restore or redis- 
tribute the PET data of the reconstructed images. But, as well as being highly suscepti- 
ble to errors in the requisite registration and segmentation procedures, such localisation 
methods are dependent on unrealistic simplifying assumptions made on the activity dis- 
tributions in the brain. 

Alternative methods begin with the sinogram data, and must therefore address the 
problem of tomographic image reconstruction. This task that is inherently ill-posed, and 
as such it is best tackled using statistical methods. Yet artifacts remain, products mainly 
of the ill-conditioned nature of the system of linear equations to be solved. Methods of 
regularisation, typically found in the form of either a penalising term [7], or of Bayesian 
priors [8,9], have been used to confront this. When high resolution anatomical informa- 
tion is available, however, stricter, more meaningful constraints can be imposed on the 
reconstruction solution. That is, the aforementioned image- space correction methods 
would now be better formulated as part of the reconstruction step. This has been clearly 
demonstrated in the literature with the incorporation of an accurate projection model 
[10], and also with the more accurate model of the activity source [1 1,12,13]. 

A new approach is demonstrated in this paper, where the contribution is to firstly 
derive a more appropriate prior model of the activity distributions, and to secondly 
present an algorithm capable of its exploitation. The prior model is designed to avoid 
over-simplifying assumptions regarding the activity distributions, and a Bayesian re- 
construction procedure is implemented to accommodate it. 



2 Problem Definition 

The digitisation of an image grid within the PET scanner’s field of view, allows us to 
assume the following: 

- A is a 2-D image of J pixels, where \j denotes the expected number of annihilation 
events occurring at the jth pixel. It is the PET activity distribution to be recon- 
structed. 
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- y denotes the J-D measurement vector, where yi denotes the coincidences counted 
by the ith of the I detector pairs. This is the sinogram data recorded by the scanner. 

- aij is the probability that an emission originating at the jth pixel is detected in 
the ith detector pair, where a^j = 1, Vi. This forms the stochastic model of 
the acquisition process, more commonly referred to as the system matrix. For the 
purposes of the algorithm developed in this paper, this model must include the Point 
Spread Function (PSF) of the scanning device; it should describe, stochastically, 
how the underlying tracer distribution came to be observed as the PET signal. 

- y^ denotes the mean, or expected number of coincidences detected by the ith detec- 
tor pair, such that y^ = ^ij • 

That is, we express our PET reconstruction problem on the basis of the following 
matrix form, 

y = AA, (1) 

where A G is the above defined system matrix that contains the weight factors 

between each of the image pixels and each of the projections recorded in the sinogram. 
Given we are dealing with radioactive decay, the detections are modelled using a Pois- 
son distribution. This distribution is defined about its means as: 

yy^ 

yi ~ Poisson{yi) = exp(-yj), t = 0 , I -1. (2) 

Vi' 



The Expectation-Maximization Algorithm for PET Reconstruction Following the sem- 
inal papers of [14,15], the Maximum Likelihood (ML) estimate is derived using the 
Expectation-Maximization (EM) algorithm [16]. In this case, the complete data set, de- 
noted f>ij , is defined to be the number of emissions occurring in pixel j and recorded 
in the ith detector pair. The measured [incomplete] data can now be re-expressed in 
terms of the complete data as yi = ipij, for all i = 0, — 1. The alternative 

summation yields the total number of emissions that have occurred in pixel j and have 
been detected in any of the crystals, ~ A j, for all j = 0, ..., J — 1. On this 

basis, the problem of PET reconstruction now becomes one of estimating the means of 
the ^ij. 

The likelihood function of the complete data set is a measure of the chance that we 
would have obtained the data that we actually observed {y), if the value of the tracer 
distribution (A) were known: 



im Jm 

Lw = n n exp(-^y ). 

i=0 j=0 

Erom the above one is able to derive the following iterative scheme [15]: 



= 






7 ( 8)1 

■E 



^ij Vi 



E 7 ( 8)1 / -J 'sr-yJ<S>^ \k ^ 

i=0 i=0 



(3) 



( 4 ) 



where k denotes the iterate number. 
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The Ill-Conditioned Nature of the ML Estimates Unfortunately, as the ML estimates 
become more refined, they also become excessively noisy. That is, the wrong form 
of variance is increased and a pseudo over-fitting occurs to progressively worsen the 
final image. The point at which this deterioration begins, the start of over- convergence, 
depends upon a number of factors and is all but intractable. 

Part solutions for this problem may be found by implementing new stopping rules 
[17,18], post filtering [19], or by using some form of regularisation to impose appro- 
priate constraints on the reconstruction solution (see, for example, [7,20]). Alterna- 
tively, the solution may be constrained to distribution models [8], which has proved 
popular among the image reconstruction community [9,11,21,22,23,24]. Applicable 
though these random field models are, PET reconstruction can benefit further from pri- 
ors derived from Magnetic Resonance Imaging (MRI) data. This additional information 
typically provides boundaries within which the random field models may be applied 
[12,13,25,26,27]. The resulting increase in accuracy is restricted only according to the 
crudity of the assumptions necessary to realise the implementation. 



3 The Bayesian Approach 

The parameters to be estimated amount in many senses to a hypothesis of the data. 
In this sense, a likelihood function only tells us how well our hypothesis explains the 
observed data. Bayes’s theorem tells us instead the probability that the hypothesis is 
true given the data [28]. The hypothesis, or prior, to be derived is a first estimation of 
the activity distribution of the subject, and the Bayesian paradigm allows us to maintain 
our faith in this estimate, letting only the observed data persuade us otherwise. 

Gindi [26], and Leahy [27] achieve this with Maximum A Posteriori (MAP) esti- 
mates of simulated PET data incorporating line sites (originally proposed by [8]) from 
associated MRI, which are appended to the set of parameters to be estimated in each 
maximisation step of the procedure. The obvious extension to this approach is to weight 
these sites such that they find agreement to edges present the PET data [29] . Instead of 
line sites, Bowsher et al. [30] build a segmentation model into their reconstruction pro- 
cess whose repeated re-estimation is able to gauge the progress of the reconstruction. 
Lipinski et al. [12] use MRI priors to delineate different Markov and Gaussian energy 
fields acting to regularise the solution. Sastry and Carson [13] adopt a similar approach, 
this time coupling both the Markov and Gaussian fields in an effort to reconcile two 
desirable properties of the reconstruction solution. These, in the words of Leahy [31], 
being that: images are locally smooth; except where they are not! 

The models, and the assumptions behind the two aforementioned random field pri- 
ors used in [12,13] are the following: 

“ Global Homogeneity: within each tissue compartment, the activity concentrations 
are Gaussian distributed with a unique mean. This is expressed using a Gaussian 
Random Eield. 

- Local Homogeneity: within a homogeneous activity distribution, neighbouring pix- 
els tend to have similar values. This they model with a Markov Random Eield 
(MRE) defined over a first order neighbourhood [32]. 
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[13] apply these to tissue-type activities individually, where the tissue-type n is de- 
rived from associated MRI data. The first of the above priors does not enforce any 
local neighbourhood properties on the reconstruction, but instead assumes that an ac- 
tivity level corresponding to a tissue of a known class will not be significantly different 
from that of the mean activity level for that class. This is termed the Gaussian prior. 
In the case of the second prior, the piecewise smoothness assumption common to many 
Bayesian methods is used, and the PET image is thought to contain activities that vary 
little across neighbouring pixels. [13], however, are original in applying this prior in 
concert with the first (thus forming the so-called Smoothness -Gaussian prior), in an 
attempt to constrain local variation alongside the restriction that activity levels should 
remain within sensible, global bounds. 

4 Developing New Priors Based on a High Resolution PET Image 

The development of a good prior can be all important to the success of the Bayesian 
scheme. As such, a possible weakness of the above approach is the use of mean activ- 
ity estimates, which, in meeting the necessary regularisation requirements, are likely 
to encourage homogeneous distributions. The estimation of these means is derived 
from knowledge of the tissue type, where, for each compartment, CSF, WM, GM, 
and other, this is given by the following activity ratios: 0:0.005:1:4, respectively (see, 
[6,33,34,35,36,37] for similar such estimations). Under the assumption that the activ- 
ity in each tissue-type is uniform, the mean activities, denoted for tissue-type n, 
are derivable from a least-squares solution to the tomographic system from a pseudo- 
inverse of the system matrix: (see the original definition from equation 1). 

The following proposes a new approach to defining initial activity estimates, whose 
emphasis is to allow inhomogeneous distributions to occur in each of the tissue classes. 
It is a model-based approach whose solution is an image-space PET correction. The 
development of a reconstruction algorithm designed to iteratively update the solution is 
left as the subject of the next section. 

Some Assumptions on the Activity Distribution If at all avoidable, presumed homoge- 
neous distributions should not be a part of any attempt to model (and thus, correct) the 
PET signal. Erom [38], for example, we learn that it is normally GM which shows the 
greatest amount of variance; CSE is likely to show some, although this is negligible; 
and WM should show some aspects of variability in cases of diseases, such as mul- 
tiple sclerosis. Additionally, this work cites lesions, focal activations and field hetero- 
geneities as typical factors that would violate the homogeneity assumption. Justification 
is also to be found in the discussion of Ma [34], where the opinion that sharp activity 
changes occurring at structural boundaries is refuted (i.e., those occurring under the ho- 
mogeneity assumption are unlikely), and that it is more probable that the distributions 
can be characterised by some gradient in tracer level occurring within and possibly 
across structures. 

Eor practical purposes, the increased knowledge won from simple homogeneity as- 
sumptions is applied in different forms by [3,5,6,33,37] for PET correction and redis- 
tribution, and [34] for PET simulation (a process applied to [36,39,40]). Improving the 
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assumptions, however, requires a better knowledge of the emission source, where it 
is difficult to avoid generalisations. The method developed in the following, however, 
heeds the advise of [38] and attempts to do just that. 

Intensity Normalisation - A High Resolution PET Prior We derive in the following a 
model for activity distributions based on an agreement with the original ideas of Friston 
et al. [38]. In fitting the underlying flow-distribution of the PET image, one is able to 
derive a correction for the observable PET data constrained in accordance to the MRI 
delineations of true regions and knowledge regarding the scanner response. The solution 
is a corrected “PET” image at the resolution of the MRI data, although in the context of 
its later usage, we will refer to this distribution as a high resolution prior. 

Supposing that the MRI and PET images are in co-registration \ we say that the 
original PET image (A^, reconstructed without the use of any model) can be described 
by an intensity transformation of the associated segmented MRI image (mp and a 
convolution that relates the differences in resolution. That is, 

(5) 

where * denotes convolution, hj is the convolution kernel that reflects the resolution 
mismatch in the PET and MRI data, and 7 j is the intensity transformation that we wish 
to derive. The MRI data are segmented into regions of GM, WM, and CSE, from which 
activity levels are assigned according to estimated ratios [6,13,33,34,35,36,37] (the im- 
plicit fourth compartment is background, for which zero activity is the expected level). 
The expansion of equation 5 in [38] is in the form of a the Taylor Series about a sin- 
gle GM segmentation function, and operates therefore under the premise that the PET 
signal in WM and CSE regions is sufficiently negligible to be absent from the model. 
The coefficients of the expansion are themselves expanded in terms of B basis func- 
tions such that they are non- stationary and smoothly varying about a given local [38]. 
In the following, only this latter aspect of the model is retained, as the segmentation 
is made explicit using a fuzzy algorithm [43] to derive probability maps of affinities to 
GM, WM and CSE; denoted below as m^, mj and rrij, respectively. The model is now 
defined to contain three intensity transformation functions, 7 |, and 7 |, operating on 
the GM, WM and CSE segmentations: 

A° R:! hj * [gj^ {m7 + w'yf {mj} + ( 6 ) 

where again each intensity transformation function is made up of basis functions 
and non- stationary coefficients (equation 7), and g, w and c are normalised ratio con- 
tributions. The basis functions, applied to each intensity transformation 7 ^ for 

^ It is important to note here that as the images are in co-registration, then their pixel dimensions 
and the physical sizes of each pixel are the same. Nonetheless, one can still talk about one data 
set being of a higher resolution than another in that it exhibits a higher resolution. In this case, 
although the PET data may be at the same physical resolution as the MRI data, it does not 
utilise this sampling density as well as the MRI, and is considered therefore, to be of a lower 
effective resolution [41,42]. 
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T] E {g^w,c},SLre derived from, 







(7) 



6=0 



where denote the [unknown] coefficients of the expansion. 

Deriving the High Resolution PET Prior Once a solution for each of the B vectors 
is found, we are able to derive a high resolution PET prior (Ap by simply removing the 
convolution term: 



This yields a PET image at the resolution of the MRI data, valid in accordance to the 
original assumptions made of the model. That is, it emulates a “restored” PET image, 
where the restoration is predominantly a deconvolution, constrained by the delineation 
of the GM, WM and CSE distributions in the MRI data. 

The Resulting Intensity Transformation Results from the use of this transformation are 
shown in figure 1. As the figure shows, with only a small number of basis functions 
the transformation is quite convincing in its ability to map the MRI data to the PET 
data. Eor purposes of validation, figure 2 shows the transformation applied to a Monte 
Carlo based simulation of the acquisition process, where in this instance some notion 
of ground-truth is available, yet noise and other distorting effects are realistic. 

In the context of the image- space correction techniques that use MRI data to com- 
pensate for the PVE in PET, the result itself constitutes a solution to the problem of PET 
redistribution. It is of high resolution, and of low noise. Uncertainties in the segmenta- 
tions, however, are propagated to uncertainties in the solution. This effect is evident in 
figure 1 where regions of the resulting prior seem smeared; the algorithm is only able 
to average a fit. As such, the solution cannot guarantee uniqueness, and the validity of 
the correction method must therefore be gauged on some localised basis. Eurther justi- 
fication for this requirement is apparent in regions where the MRI data is homogeneous 
and the PVE does not occur. In this instance, the prior should have no influence on the 
reconstruction solution. 

Such heuristics can, to a good extent, be used to drive the reconstruction process, 
and this PET correction result may yet be iteratively improved upon. This requires a 
formulation of the problem within the Bayesian framework, which is the basis of the 
remainder of this work. 

5 Applying the High Resolution PET Data as a Prior 

The Adopted Form of the Prior To build the above derived prior distribution into 
a reconstruction algorithm, the approach given here follows the general methods of 
[11,12,13]. The prior is wrapped in an energy function, designed to be minimal when 
the estimate for the reconstruction, Aj , matches that of the activity estimates derived 




( 8 ) 
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The Intensity Transformation with 8 Basis Functions, on 64x64 Pixel Images: 




f ^ K 




The New Prior Image Original Filtered Back-Projected MRI Intensity MRI Segmentations 




and the Intensity Transformation with just 4 Basis Functions, on 64x64 Pixel Images. 



Fig. 1. From left to right, the figures shown above are: the high resolution PET image 
(the new prior - of equation 8); the PET image reconstructed from the sinogram data 
using Entered Back-Projection (EBP); the MRI intensity transformed image (the PET 
model with the convolution term present: hj * A^; the MRI GM segmentation (mp; 
and the MRI WM segmentation (mj). 




Fig. 2. Erom left to right, the figures shown above are: the true distribution of a PET 
image (Monte Carlo simulation based); its associated segmentation; the Eiltered Back- 
Projection (EBP) reconstruction based on the simulated sinogram (including noise, scat- 
ter and random coincidences); The high resolution PET image, Xj, recovered from the 
segmentation and the EBP image alone. 

from the intensity transformation prior, denoted Xj. This gives us a prior probability 
model, applied in the following form, 

P(A) = |exp(-[/(A)), (9) 

where Z is a normalisation term (the partition function), and U{X) is the energy 
function chosen to impose localised constraints on the reconstructed activities. The a 
posteriori probability for tissue-type activities given the sinogram data, for which a 
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maximum is sought, is P(A|^) oc p{y\X)P{\), where, p{y\X) is our likelihood function, 
and P{X) is the prior probability model. The applied prior is thus defined as P{X) oc 
exp(— f/ (A)), with the energy function to be minimised given as: 

V- - A?)' 

U(X) = J2 

3 ^ 

This takes the form of a Gaussian distribution, where in this implementation the esti- 
mation of tissue activities, Xj, are those of the estimated prior distribution of equation 8. 
This involves the individual estimation of each pixel intensity, with an additional notion 
of local smoothness implicitly included as a result of restrictions on local variation set 
by basis functions in equation 7. The consequence is that the constraints combine in 
manner akin to [13]’s aforementioned smoothness-global prior. In estimating activity 
at each pixel (through the X^), however, it is the granularity of the basis functions that 
determines how well the assumption on piecewise smoothness is avoided. 

The cTj control the Gaussian’s standard deviations, which in turn reflects the strin- 
gency of the reconstruction’s coupling to the prior. This can be altered on a localised 
basis, allowing the algorithm to show great versatility. As aj -^0, the reconstruction 
tends toward the prior (or the image- space correction method), and as dj oo, it 
tends toward the EM solution. The estimation of these hyperparameters take on phys- 
ical interpretations from [13] in that they relate to the allowable variation in activity 
levels. Here, an additional interpretation can be afforded, as these values are estimated 
in accordance to the likely influence of the PVE; i.e., related to the degree of correction 
required. This, in turn, may be estimated from the MRI data, as discussed below. 



Deriving the Iterative Algorithm The likelihood function of the means of the complete 
data set, L(^), was given in equation 3. Relating the complete data set to the activity 
distribution, "ipij = Xj, and defining the conditional expectation value for the 

'ipij on the basis of a standard probability result, yields the expectation step of the EM 
reconstruction algorithm. This, denoted is given as: 






^ij Aj yi 

A /c ’ 

2^j=0 



( 11 ) 



where k denotes the iteration number, and £ the expectation. Erom equation 3, the 
log likelihood expressed in terms of Nk is, 

701 J01 

^{^''+1) = lni(^''+i) = g £(iV^. Ina^A,- - a^A,- - IniV^-!). (12) 

i=0 j=0 



In the Bayesian approach, instead of maximising the likelihood function, it is nec- 
essary to maximise the a posteriori probability. This must include our energy function 
from the form of equation 9, yielding. 



argmax\{l(pl3) — t/(A)}. 



(13) 
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From [1 1,13], we optimise by setting the derivative of — U{\)} (with respect 
to A) to zero: 

^2 yP I'Sil 

^ - Xj + Ntj = 0. (14) 

j i=0 j i=0 

Taking only the positive root to the solution of this quadratic equation, yields the 
following iterative reconstruction scheme due to [11,13]: 






K) 



E E ’ (15) 



where N^- is defined according to equation 1 1 . 

Selecting the Gaussian Function ’s Standard Deviations ( aj ) The purpose of the algo- 
rithm in section 4 was to derive a prior estimation of activity levels for each PET pixel 
at the resolution of the MRI image. As such, the energy term of equation 10 should also 
be chosen for each pixel individually, and the standard deviation of each Gaussian field 
thus reflects this. 



Use of an Entropy Measure Regions likely to suffer the PVE are those where the vari- 
ation in structure is the greatest; outside of such regions there is no need to redistribute 
the data. As such, (jj must be steered according to a local measure taken on the MRI 
data. This is achieved using an entropy measure about small windowed regions. For 
example, in areas of high entropy, where we the data exhibits less structural variation, 
the corresponding (jj terms should become wider. 

The application of such an entropy measure requires that its actual value and range 
be made explicit. It is then used to prompt variation in the a j terms, which are otherwise 
set in accordance to the paper of [ 13] : 



K.C.Cj 

2(ln2)h00’ 



(16) 



where K defines the full- width half-maximum of the Gaussian, interpreted as being 
K% of some scaling constant C. In the results given in the remainder of this section, 
the entropy image, e^, is simply scaled to within ±0.6 ± 1 (0.4 where the prior shows 
greatest structure, and 1.6 corresponds to completely homogeneous regions), and C is 
set to the average intensity value in our prior image (A^). Admittedly, these are relatively 
arbitrary choices, but ones that succeeded in consistently reducing the least-squares 
error in tests where the ground-truth data is known. 



6 Results 

Experiments have been performed on both artificial, Monte Carlo, and actual PET-MRI 
studies. In the first instance, the manufacturing of test data involved the blurring of a 
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phantom image, the addition of Poisson noise, and finally its forward-projection to pro- 
duce the sinogram data^. Insight into the algorithms workings was then gleened, serv- 
ing primarily to put the variation according to the entropy measure to within sensible 
bounds. 



White Matter Segmentation 




True Distribution 1st Iteration 2nd Iteration 3rd Iteration 4th Iteration 



Fig. 3. In the above, the top 5 images show the intensity transformation scheme, as dis- 
cussed in section 4. The first image is the high resolution PET image (Ap, which in this 
case is derived from a PET reconstruction (the image second from the left - Ap based 
on the Ordered- Subsets EM (OSEM) algorithm [45]. This is a fast EM reconstruction 
method operating at speeds that are now clinically acceptable. The central image is the 
model of the OSEM distribution, and the remaining images show the WM and GM 
segmentations, respectively. The bottom row displays the following: the true, underly- 
ing PET distribution (simulated to exhibit the resolution of the MRI data, note); and 
the 1st, 2nd, 3rd and 4th iterations of the Bayesian reconstruction for K = 30. Only 
four iterations are shown, as convergence is seen to occur rapidly. The reconstruction 
“snaps” very quickly onto a solution, which is characteristic behaviour for methods in- 
volving such constrained priors. One must, therefore, be careful that the solution is the 
appropriate one. 

The results of the simulated experiments shown in figure 3 show how the recon- 
struction solution is able to very quickly take the form of the true distribution. Erom 
this it might be thought that the algorithm is simply converging toward the prior. Ev- 
idence against this is that the least- squares error between the true distribution and the 
reconstruction solution improves upon that between the true distribution and the prior 
after only a few iterations. That is, iterative adjustments to the prior-based solution are 
necessary. 

Although the results of figure 4 from real MRI and PET studies have no associated 
ground-truth, visually they are able to demonstrate two important features: structure 
is retained, delineating known tissue regions; and the general intensity values are, on 

^ This data set was taken from the BrainWeb Database [44]. 
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a regional basis, the same as those of the Ordered Subset EM (OSEM) reconstructed 
image. The top row shows images reconstructed with the Bayesian scheme presented 
in this paper. In this case, the entropy measure is applied, and K = 3b (equation 16). 
The next row shows increased iterations (from the same starting estimate) of the OSEM 
algorithm. As 8 subsets were used, each iteration is approximately equal to 8 iterations 
of the EM algorithm [45]. Being a statistical technique, the OSEM algorithm uses the 
same system matrix as the Bayesian scheme, and should therefore also account for 
resolution loss due to the scanner’s PSE. The next row shows the Bayesian scheme 
without the use of the entropy measure. The resulting reconstructions are able to very 
quickly demonstrate better contrast and structure in the images, especially so in regions 
where the PVE is likely to be of greatest influence (in regions of low entropy). 



1st Iteration Continued Iterations (from the same starting estimate) > 5th Iteration 




Bayesian Scheme 
with Entropy ^ 

K=35 





Fig. 4. The above figure shows three different reconstruction schemes each starting from 
the same intial estimate, an OSEM reconstruction using 8 subsets reconstruction after 4 
iterations. It was also on the basis of this image that the high resolution prior (X^) was 
defined. The top row (a) shows increasing iterations (1,2 and 5) of our Bayesian scheme 
that includes the entropy measure to increase recovery from the PVE. The next row (b) 
shows the OSEM reconstruction scheme as it iterates further. The final row (c) shows 
the Bayesian scheme without the entropy measure. Eor both of the Bayesian methods, 
K of equation 16 is 35; a slightly more conservative value than that for the experiment 
using simulated data. 
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7 Discussion 

In Respect of the Prior It was never envisaged that the intensity transformed MRI 
segmented image could be capable of replacing the reconstructed PET data. It is an ar- 
tificially constructed model, and hence the distribution’s use as a prior. The transform 
simply assigns PET values to structural objects, and the result can only constitute “a 
functional categorisation of anatomical structures” of the sort implied by the segmen- 
tation [38]. In a way, it does emulate a fully “restored” PET image, yet the restoration 
is only as good as the validity of the assumptions concerning the activity distributions. 
We feel, however, justified in using this method to estimate a prior distribution, as the 
assumptions are indeed valid, with homogeneity notable only because of its absence. 



Choice of the Basis Functions As figure 1 shows, the more basis functions that can be 
applied, the better the fit that can be attained. However, by imposing a finite upper limit 
to their size, we introduce the notion of a neighbourhood to the reconstruction process. 
As such, it is necessary to use basis functions to capture PET activity at its finest pos- 
sible localisation of activity (see [46]), but not beyond. Any finer than necessary would 
simply mean modelling the noise. What granularity captures constituent components 
of the activations can be approximately estimated from [47], for example. Or indeed, 
the approach adopted in the Statistical Parameteric Mapping methods of the Eunctional 
Imaging Laboratory in London [48]. Here, considerable smoothing is done prior to any 
significance testing in activation studies, and the short conclusion is that the use of 8 
basis functions for 64-by-64 pixel images, 16 for 128-by-128 images, and so on, would 
seem about right. 



Regarding the Reconstruction Algorithm The algorithm presented operates with a num- 
ber of free parameters whose better selection is required to achieve improvements in 
performance. The appropriate selection of the (jj in equation 16 is most critical. It ba- 
sically steers the algorithm toward a normal iterative reconstruction solution at one 
extreme, or to an effective PET correction method at the other. Eor the selection of this 
and other such hyperparameters, there are basically two options [31]: the data-driven 
empirical approach; or an estimation theoretic approach. 

Unfortunately, alterations to this term and the subsequent algorithmic flexibility re- 
sulting from the local variation of the (jj of the Gaussian fields may not be appreciated 
by the Bayesian purists. Nonetheless, it is important in emission tomography to con- 
strain the solution only as and when it is correct to do so. Even if a contribution cannot 
be afforded globally, local improvements can have a positive effect. Among Llacer’s 
results from studies of such distributions in Bayesian reconstructions [49], was the in- 
dication that prior information applied in some areas of an imaging field has a tendency 
to improve the results of a reconstruction elsewhere. This may at first seem a surpris- 
ing and perhaps rather dubious result, but when one considers how highly correlated an 
emission image is, and how its piecing together is a problem of global optimisation with 
constraints such as positivity and energy conservation, then increasing the certainty of 
the solution in one area is indeed likely to aid the solution in another. 
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7.1 Conclusions and Future Work 

The least- squares fit of the intensity transformation gives, depending on the individual’s 
agreement with the initial assumptions, a sensible estimate of the tracer distribution at 
the resolution of the MRI data. The result is a high resolution, low noise prior, whose 
application within the framework of a Bayesian reconstruction scheme allows for an 
appropriately corrected, well regularised, reconstructed PET image. 

Issues of registration and segmentation errors, however, highlight the well known 
shortcomings in all such cross-modality reconstruction methods. This work’s adoption 
of a fuzzy segmentation method coupled to a summation of basis functions in order 
to estimate the underlying activity distribution does, to a large extent, exhibit some 
robustness with respect to the latter of these two issues. This was seen in hgure 1, but 
the ensuing necessary averaging of the intensity assignments does, of course, limit the 
effectiveness of the approach. With respect to errors in the registration of the different 
image sets, then the algorithm, like its counterparts, reveals its frailty. Errors need be 
only very slight to render the associated MRI data as ah but useless. 

With the fundamental aim of this work being to improve the resolution of the PET 
data, addressing, for example, the PVE, the approach that has been taken seeks to bridge 
the PET redistribution methods applied as post-processing techniques, and the model- 
based reconstruction methods applied at the sinogram level. Coupled in a complimen- 
tary manner, this paper has sought to demonstrate how effective this can be. 
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Abstract. This paper considers the use of the EM-algorithm, combined 
with mean field theory, for parameter estimation in Markov random field 
models from unlabelled data. Special attention is given to the theoretical 
justification for this procedure, based on recent results from the machine 
learning literature. With these results established, an example is given 
of the application of this technique for analysis of single trial functional 
magnetic resonance (fMR) imaging data of the human brain. The re- 
sulting model segments fMR images into regions with different ‘brain 
response’ characteristics. 



1 Introduction 

The purpose of this paper is two-fold: first, it reviews the theoretical underpin- 
nings for the use of the EM-algorithm in conjunction with mean field theory for 
parameter estimation in Markov random field (MRF) models from unlabelled 
data. Second, it demonstrates the usefulness of this approach by a MRF model 
for single trial functional magnetic resonance imaging (fMRI) data. 

Techniques for learning from unlabelled data are important in the analysis of 
fMRI data of the human brain, since the data generating mechanism is still far 
from completely understood. Obvious ethical reasons put limitations on what 
sort of alternative methods we can use to verify results obtained from fMRI. 
Other functional brain imaging techniques, which may appear as the obvious 
answer, suffer exactly the same problem. At the same time, the quantity and 
quality of fMRI data make automated analysis procedures necessary. 

2 Markov Random Fields, Mean Field Theory and the 
EM-algorithm 

In this section, we briefly review Markov random field models, the mean field 
theory and its connection to the EM-algorithm. Mean field theory is a since long 
established tool in statistical mechanics and statistical physics. It has also been 
extensively used in the fields of computer vision and, more recently, machine 
learning. 
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2.1 Markov Random Field Models 

A MRF [24] is a set of N random variables indexed over the vertices, or sites, 
in an ordered lattice. The typical example is a 2-D image, where the random 
variables are the labels (e.g. colour) associated with the pixels. The MRF vari- 
ables are not independent, but are mutually coupled; the key property of MRFs 
is that the distribution of the random variable associated with a site, n, given 
the values associated with the sites in a (typically small) neighbourhood of n, is 
independent of the rest of the sites in the MRF. This can be formalised as 

p{Xn\Xm,n^m) =p{Xn\Xm^>^n) , 

where Xn denotes the random variable of site n and JVn is the set of random 
variables associated with the sites that are in the neighbourhood of site n. 

The distribution over the MRF variables, which is assumed to be strictly 
positive, can be written as a Gibbs distribution, 

p{x) = ^exp{-E{x)) (1) 

where x is a dimensional vector formed by concatenating the vectors Xn 

(n = 1, . . . , A^), is an energy funetion and Z is a normalisation constant, 

Z = J2^xp{-E{x)) , ( 2 ) 

X 

where the sum runs over all possible values of x. Note that, computing Z, which 
is known as the partition funetion, is generally tractable only for very small 
MRFs, since the number of terms in the sum in (2) increase exponentially with 
the size of the MRF. This is due to the mutual coupling between the MRF 
variables. Same problem emerges if we want to compute the marginal posterior 
distribution over any of the individual MRF variables - e.g. for the purpose of 
parameter fitting - since this requires summing over all remaining variables. 

The energy function E defines the properties of the MRF model and can 
generally be written 

E{x, y, 0, P) = E^^\x, y, 0) + E^^\x, P) . 

^ext denotes the energy (or potential) arising from external influence; in the 
context of probabilistic image modelling, this typically comes from observed data 
y via a model determined by parameters 0, and corresponds to a log-likelihood 
term. denotes the internal energy which, as suggested by the notation, only 
depends on the MRF variables x and parameter /3, and corresponds to a prior 
distribution over x. 

2.2 The Mean Field Theory 

To address the computational difficulties associated with MRF models, a number 
of approximate methods have been proposed [24]. One popular such method is 
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the so called mean field approximation, from statistical mechanics [7] . This con- 
sists of replacing p(x 1 2 /, 0, /3) with an approximating, computationally tractable, 
parameterised distribution, q{x\m). As has been shown by several authors 
[3,7,31,38], the mean field approximation can be given a formal justification 
as providing a computationally tractable bound on quantities of interest (e.g. 
the partition function). Moreover, we can optimise the variational parameter 
m by minimising the Kullback-Leibler (KL) divergence between q{x\m) and 
p{x\0,(3,y), 



which is non-negative for all probability distributions q and p and equals zero 
only when they are identical. The literature on statistical mechanics [7], MRFs 
[3,38] and probabilistic graphical models [19,31] provides examples of applying 
this methodology to different models. Section 3.3 in this paper provides an ex- 
ample for a multi-level logistic MRF model. 

2.3 A ‘Variational’ View of the EM-algorithm 

Traditionally, the EM-algorithm [8] is viewed as a two-step algorithm for max- 
imum-likelihood parameter estimation from incomplete data. The first step (the 
E-step) consists of computing the expectation over the random variables which 
are missing in the data (e.g. the labels in unlabelled data), cc, given the observed 
variables, y, and the current set of parameters, 0. The second step (the M-step) 
maximises the resulting expected complete log-likelihood function with respect 
to its adjustable parameters, 0. However, it can also be seen as a algorithm for 
minimising the variational free energy from statistical mechanics and statistical 
physics [35,26], linking it to the mean field theory. From this point of view, 
it is natural to also consider situations where the exact distribution over the 
missing variables cannot be computed, but has to be replaced by an approximate 
distribution. This yields an algorithm which maximises a lower bound of the log- 
likelihood. The difference between this bound and the true log-likelihood is the 
KL-divergence between the exact and approximating distributions. 

Following Jordan et al. [19], our objective is to maximise the log-likelihood 
function of the observed data lnp(^|0), with respect to the parameters 0. We 
now write 




( 3 ) 



lnp{y\0) = p{y,x\0,f3) 



X 




q{x\m) lnp(y|0) — q{x\m) In 



q{x\m) 



p{x\0,(3,y) 



X 



( 4 ) 
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where we have used Jensen’s inequality. q{x\m) is an arbitrary, non-singular 
probability distribution, parameterised by the variational parameter m. From 
(4), which apart from a change of sign corresponds to the variational free energy 
from statistical physics, we see directly that the difference between the two sides 
is the KL divergence (3) between q{x\m) and p(x|0, /3, y). As shown by Neal 
and Hinton [26], maximisation of (4) with respect to m corresponds to the E- 
step of the EM-algorithm, whenever q{x\m) is rich enough to model p(x|0, /3, y) 
exactly. When, on the other hand, computational considerations force us to resort 
to simpler distribution models, we can still be certain that resulting algorithm 
will increase the lower bound of the log-likelihood function, unless already at (a 
local) maximum. 

3 An Application to fMRI Data 

Eunctional magnetic resonance imaging (fMRI) attempts to detect brain activity 
by localised, non-invasive measurements of the change in blood oxygenation, the 
so called BOLD contrast [27]. This is sensitive to the relative local concentra- 
tions of oxygenated hemoglobin (Hb 02 ) vs. deoxy- hemoglobin and provides an 
indirect measure of the brain’s neuronal activity. 

Measurements, in the form of a time-series of images, are collected under 
controlled conditions, where subjects are performing specific tasks, prompted 
by some stimulus (e.g. deciding whether a read out sentence is grammatically 
correct or not, perform arithmetic calculations, looking at changing scenes, etc.). 
We only consider fMRI experiments with a single trial (or ‘event-related’) design, 
which consist of a series of individual trials. Each trial consist one repetition of 
the task, followed by a period of rest during which the subject is assumed to be 
inactive. 

When we want to model the fMRI data generating process, there are neu- 
rophysiological factors we must take into consideration. The local change in 
blood oxygenation as an effect of increased neuronal activity, which is called the 
hemodynamic response (HR), is delayed by 2-6 seconds from stimulus onset and 
dispersed by 2-3.5 seconds. This delay and dispersion vary between subjects, 
experimental conditions, etc. By contrast, the stimuli subjects are exposed to 
during data collection, which is assumed to trigger the task related activity, is 
normally treated as being discrete. Often it is modelled as a binary (‘box-car’) 
function, i.e. the stimuli is either present or not present. 

Traditional analysis of fMRI data essentially amounts to locating so-called 
activated pixels, where the observed measurements shows significant correlation 
with a function representing the the task. There are different strategies for alter- 
ing this function to account for the HR, ranging from simply just shifting it in 
time [2] to convolving it with a HR model function [13,23,29]. The correlation is 
computed for each pixel individually and the correlation scores are transformed 
into Z-scores [1]. The resulting image of Z-scores, called a Z-map^ is then thresh- 
olded at a level chosen so that the probability of wrongly classifying a pixel as 
being activated is suitably low (see e.g. [13]). 
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We propose to model an £MR image, by which we mean a set of pixels on a 
regular lattice with associated time-series of measurements, as a MRF. Each pixel 
is assumed to belong to one out of K classes, with each class corresponding to a 
parametric model function for the HR. The time series associated with each pixel 
contains measurements collected at the corresponding location during a single 
trial at times ti, . . . , By choosing a MRF model, we implicitly assume that the 
spatial distribution of the classes will be locally smooth, so that neighbouring 
pixels typically belong to the same class. Thus, images will consist of one or 
more spatially homogeneous regions, each region associated with a parametric 
HR model function. The model can also be seen as a iF-component mixture 
model [17] in the D-dimensional observable pixel space (i.e. in the temporal 
domain of the HR), combined with a smoothing MRF prior distribution over 
pixel classes (in the pixels lattice). 

3.1 The Multi-level Logistic MRF Model 

To specify the prior pixel class distribution, we use the commonly applied multi- 
level logistic (MLL) model [12,14], where we specify neighbourhoods such that 
each pixel only depends on its nearest neighbours (distance equal to one in the 
lattice of pixels). We represent the MRF variable associated with pixel n as a 
iF- dimensional binary vector, Xn- Pixel n belongs to class k if and only if the 
kth. element of denoted Xnki equals 1 and all other elements equals 0. This 
model contains the binary MRF as a special case {K = 2). 

We then define the energy function, 

R ^ 

E^\x,!5) = ^Y. E ’ (5) 

Tl Xrn n -U n 

where U is a K x K matrix with elements along its diagonal equal to —1 and 
all other element equal to 1. The scalar f] plays the role of a scale parameter for 
the prior. As f3 increases, so does the cost for neighbouring pixels from different 
classes, which in effect forces a smoother image. 

3.2 Modelling the Hemodynamic Response 

Several model functions for the HR have been proposed [5,13,23,29]; we choose 
to model the HR using a Gaussian function [22], such that 

h{t) = 7^ exp ^ + o, (6) 



where, 

/r denotes the lag^ i.e the time from the onset of the stimuli to the peak of the 

HR, 

a denotes the dispersion^ which reflects the rise and decay time. 
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T] denotes the gain^ or amplitude, of the response, and finally 
o denotes an offset that defines the minimum level for the HR model function, 
relative to some baseline level. 

For numerical convenience, a and 77 are expressed using auxiliary variables, 
and 2;^, so that 

a = exp(2;cr) and 77 = exp(z^). (7) 

This will ensure that a and 77 are always positive. In case of 77, this is actu- 
ally a simplification, since there is evidence for localised deactivation in re- 
sponse to stimuli. We denote the parameters 0 = [61^ ... ,6k], where 9k = 

[M/c 5 ^<jk , ^T]k , ^/c] • 

We combine the K HR model functions with an isotropic Gaussian noise 
process with variance common to all HR model functions. For a pixel n, 
which belongs to class k, the probability distribution for the T)-dimensional 
observable trial vector yn can then be written as 

p(yn|0fe,a) = ^ exp (^-^WVn-hkf^ (8) 

where hk is a T)-dimensional vector corresponding to the HR model function, 
computed from (6) and (7) at times ti, . . . using the parameter vector 0/.. 
Note that, this model implicitly assumes that any two random vectors yn and 
Vm, n m, are independent given the classes of the corresponding pixels. 

From the negative logarithm of (8), we can derive the external energy for the 
MRF model 

N,K 

E^^\x,y,0,a) = ^^\\yn-hkfxnk , ( 9 ) 

n,k 

where '^n'k ~ Yin abbreviated notation will be used throughout the 

rest of this paper. Recall that Xnk is 1 if and only if pixel n belongs to class k 
and 0 otherwise. The term arising from the normalisation factor, {af27r)^^‘^ , has 
been dropped as it does depend on x. 

A Prior for the HR Parameters. Given our knowledge about neurology 
and fMRI in general and the experimental design in particular, we have certain 
a-priori beliefs about what can be considered reasonable values of the HR pa- 
rameters. We can express beliefs by specifying a prior distribution over the HR 
parameters. Here, we choose a simple independent Gaussian distribution. 




where 0 is a 4A-element vector containing the expected values for fik, ^ak, 
Zj^k and Ok, k = 1,...,A, and Ve is a diagonal covariance matrix with the 
corresponding variances along its diagonal. 
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3.3 Mean Field Equations for the Multi-level Logistic Model 

Combining (9) with (5), we get the energy function 

N,K 

E = ^ ^ , (H) 

n,k 



where 

Enk = ^WVn - hkf + ^ ^ xlUk , (12) 

Xrn^ J\fn 

where in turn U k denotes the kth column of U. From (1), we can write the 
corresponding distribution over the MRF as 



N,K 



P{x\y, 0, a, /3) = — exp I - y] EnkXnk 

n,k 



(13) 



For the mean field approximation, we choose g' to be a simple independent 
multinomial distribution, where each lattice variable, Xn, has its own variational 
parameter, 

N,K 

q{x\m) = Yl . (14) 

n,k 

rrin is a iF- dimensional vector whose elements are all positive and sum to 1; the 
kth element of rrink^ represents the probability that pixel n belongs to class 
k. m denotes the concatenation of m^, n = 1, . . . , A^. 

Substituting (13) and (14) into the quotient in (3), performing some elemen- 
tary algebra and then averaging with respect to q{x\m), we get 



n,k 



where is identical to Enk in (12), except that Xm has been replaced by 
Taking the derivative of this with respect to rrink^ using Lagrange multipliers, 
Cn, to ensure that rrink = 1 for all n (see e.g. [10]), we get 

DD 

In rrink + 1 + E^j^ + (n , 

where E^j^ is identical to except that the factor p/2 has been replaced by 
as a consequence of neighbourhood symmetry. Setting these to zero, we can 
solve for using that and subsequently for rrink ^ yielding 



"f^nk — 






( 15 ) 
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which are the mean field equations for the MLL model which can be solved iter- 
atively for a fixed point solution. An alternative derivation, drawing on analogies 
to statistical mechanics, can be found by Zhang [36]. 

At the moment, it is not established under which conditions these equations 
converge; Zhang [37] analysed the convergence for an Ising model equivalent 
with a binary MRF, which was found to converge under certain conditions. In 
practice, convergence does not appear to be a problem - the parameter m settles 
rapidly, and failure to reach absolute convergence simply means that our bound 
on the log-likelihood will be less tight. 

3.4 Parameter Estimation 

Until now, we have implicitly assumed that the all the parameters are known. 
This is typically not the case, but given the theory in Sect. 2, we can use the 
EM- algorithm to estimate parameters of interest. In the E-step, we compute 
the mean field approximation (15) to the posterior distribution over the MRE 
variables. In the following M-step, we maximise the resulting expected complete 
log-likelihood with respect to the parameters. 

Here we restrict ourselves to maximisation with respect to & and a. The 
hyperparameters for the prior distribution over HR parameters are set using 
knowledge about the experimental design and general HR characteristics. [3 is 
set by experimenting; experience so far suggest that the final result is not very 
sensitive to the exact choice of (3, which was also reported in [36]. 

We derive our objective function from a hypothetical log-likelihood function, 
where the class labels, are known. As we assume that the observations at 
different pixels are independent given the corresponding class labels, we get the 
penalised log-likelihood function from (8) and (10) as 



where we have omitted terms which do not depend on 0 or a. We obtain the 
corresponding expected complete penalised log-likelihood function simply by re- 
placing the Xnk by their correponding mean- field expectations, rrinki computed 
from (15). 

Maximisation of the resulting objective function with respect to 0 is done 
by numerical optimisation^. Eor a, we get an update formula in closed form 




n,k 



ND 




where hk are computed using the updated parameters Ok^ 



^ We use the function f solve from the software package Octave [11] for this purpose. 
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Mean Field Annealing. The parameter estimation problem is fairly difficult 
optimisation problem, and empirical evidence suggest that there are many poor 
local optima where the optimisation procedure can get stuck. To reduce the risk 
of this, we employ a simulated annealing scheme [6,20], multiplying (11) by an 
inverse temperature factor (1/T), T > 1. Setting T > 1 will smooth the (ap- 
proximate) posterior distribution, which in effect will smooth out shallow local 
optima. Thus, optimisation in the high-T regime (say T = 10 for data normalised 
to zero mean and unit variance) will hopefully find a global optimum which then 
can be tracked by re-estimating the parameters, 0, as T is being decreased to 
1. Note that, as long as T > 1, a is kept fixed to 1; this annealing phase is 
then followed by further optimisation where both 0 and a are adapted. Anneal- 
ing approaches have been used successfully with MRF models for restoration 
of e.g. fMRI images [9] and, in combination with mean field theory, anatomical 
(‘non- functional’) magnetic resonance images [33]. 



3.5 Example 

In this example we use data from an fMRI experiment designed for investigating 
the neuronal correlates of sentence comprehension in the brain [25]. Subjects 
had to decide whether an aurally presented sentence contained a syntactical 
violation or not. The experiment employed a single trial design where each trial 
had a length of 24 seconds. Each trial started with a sentence being read out, 
which lasted 2. 3-4. 5 seconds. fMR images with a spatial resolution of 64 x 128 
pixels were collected every 2 seconds, so the trial vector for each pixel consists 
of 12 measurements. In total, there were 76 trials, although the first 4 were not 
used. The data were pre-processed to correct for subject movements, remove 
baseline trends and filter out physiological and system noise [21]. 

For this example, data from the 72 selected trials were averaged, to improve 
the signal to noise ratio. The resulting averaged data were used to train a model 
with 2 HR functions and a constant ‘background’ function, intended to explain 
regions where no task related activity occur. This constant function has a single 
parameter, namely its value, whose maximum likelihood update is the time- 
averaged response at individual pixels, averaged over the posterior distribution 
over pixel classes. The HR functions shared a common prior given in table 1 and 
[3 was set to 1. The fitting procedure started with 20 iterations during which 
T was decreased linearly from 10 to 1 and a was held fixed at 1.0, followed by 
another 20 iterations where T = 1 and a was allowed to adapt. 



Table 1. Hyperparameters for the prior distribution over HR parameters used 
in the example described in Sect. 3.5. /r and Zcr are measured in (log) time steps, 
while Zr^ and o are measured (log) relative to a normalised BOLD response 



Lzcr ^r] O Vo 

6 3 In 2 2/2^ In 4 3/4^ 0 1 
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The left image in Fig. 1 shows the resulting segmentation of the functional 
mask obtained from our model; pixels have been assigned to the class with high- 
est posterior class probability, computed using the mean field approximation, 
after the parameters had converged. Figure 2 shows the corresponding HR func- 
tions. Different types of filtering in the pre-processing [21] will cause marginal 
variations in these results, but the overall picture will remain the same. The 
right image of Fig. 1 shows a Z-map for the same data set, based on correlation 
with a shifted ‘box-car’ function, overlaid on the functional mask (see e.g. [13]); 
note that, only pixels with positive activation are shown, as we do not consider 
deactivated regions. 

As can be seen the two HR functions take on different roles, one explaining 
regions with a relatively strong and slightly earlier response, and corresponds 
roughly to pixels with strong activation (high Z-scores); the other explains a 
weaker and slightly later response, and includes pixels with lower activation. 



4 Discussion 

In this paper, we have reviewed the use of the EM-algorithm combined with mean 
field theory for parameter estimation from unlabelled data in MRF models, and 
the theoretical justification for this, based on results from the machine learning 
literature. Furthermore, we have shown an application of this procedure for 
analysis of fMRI data - a learning problem of inherent unsupervised nature. 

It should be clear that the overall framework is independent of the choice of 
HR model function, and thus other variants could be considered. Similarly, we 
could consider the use of a more elaborate noise model; Kruggel and von Cramon 
[22] discuss the use of an autoregressive (AR(1)) noise model in the spatial 
domain. 

A limitation of the work we have presented in this paper is the remain- 
ing number of free parameters. (3 is currently set by experimenting. Deriving a 
method for updating (3 in the light of observed data is difficult, since the par- 
tition function for the MRF prior depends on f3. Zhang [36] suggested using 
a mean field approximation also for the partition function, but as pointed out 
by Jordan et al. [19], this result in an update equation based on two different 
bounds, which theoretically may decrease the log-likelihood of the data given 
(3. An alternative approach would be to use Monte Carlo sampling methods for 
the parameter fitting. Such an approach would be computationally demanding; 
a potential remedy could be to estimate 0 and a using mean field theory, and 
use Monte Carlo methods only for the updates of f3, which need not be up- 
dated every iteration of the EM-algorithm. The number of HR components, K, 
is currently set by the user, based on empirical evidence, prior knowledge and 
interpretability. It would clearly be desirable to be able to estimate K from the 
data, but such estimation would face the same difficulties as the estimation of 
f3, since comparing models with different values for K requires computing the 
corresponding partition functions. Nevertheless, methods based on minimum de- 
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Fig. 1. A segmentation obtained using proposed method (left) and a correspond- 
ing correlation based Z-map (right), for the data described in Sect. 3.5. In the 
left image, pixels in the functional mask have been classified according to their 
maximum posterior class probabilities; the corresponding HR model functions 
plotted in Fig. 2; the dominating white class corresponds to the background 
function. In the Z-map, which is overlaid on the functional mask, pixels have 
been shaded according to their Z-score, where brighter pixels indicate higher 
activation 
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Fig. 2. The HR functions corresponding to the segmentation shown in Fig. 1. 
The solid line corresponds to the light grey pixels while the dashed line corre- 
sponds to dark grey pixels. The error bars corresponds to 1 standard deviation 
of the data from the mean given by the curve 
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scription length theory [30] or maximum entropy principles [34] , combined with 
approximate methods for computing the partition function, could be considered. 

The idea of deriving mean field equations by minimising the KL-divergence 
given a choice of approximating distribution raises the question whether other 
approximating distributions can be found that gives a tighter bound and re- 
mains computationally tractable. Jordan et al. [19] gives several such examples 
examples in the context of learning in graphical models, some of which could 
potentially be applied to MRF models. 

An obvious limitation of the mean field approximation discussed in Sect. 
3.3 is that it is unimodal, i.e. the spatial distribution of pixel class labels is 
centred around a single configuration. This might be a a reasonable approxima- 
tion when modelling averaged data from a single experiment, as in Sect. 3.5, 
but if we want to investigate between-trial variance within one experiment or 
even the (dis) similarities between trials from different experiments, it is clearly 
insufficient. Jaakkola and Jordan [16] proposed the use of a mixture of fully fac- 
torised mean field distributions, and Bishop et al. [4] empirically demonstrated 
the usefulness of this approach in the context of sigmoid belief networks. A fu- 
ture direction of research will be to extend the approach presented in this paper 
to the use of such mixture distributions, and investigate the usefulness of this 
for the purpose of fMRI data modelling. 

For fMRI data, it is also natural to consider modelling structure in the time 
domain, since an experiment consists a sequence of images corresponding to 
the sequence of trials. Theory for such a model could be built on the existing 
theory for hidden Markov models (HMM) [28], which has recently been subject to 
substantial development in the context of graphical models and machine learning 
[15,19,32]. This strand of research will be pursued as an extension of the mixture 
model discussed in the previous paragraph. 
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